Passed down wisdom can distort reality: Rather than developing their own contextual understanding, student models rely heavily on their teacher models’ pre-learned conclusions. Whether this limitation can lead to model hallucination is highly debated by experts.Brauchler is of the opinion that the efficiency of the student models is tied to that of their teachers, irrespective of the way they were trained. What this means is that if a teacher model isn’t hallucinated, chances are students won’t be either.Agreeing with most of that argument, Arun Chandrasekaran, VP Analyst at Gartner, clarifies student models may indeed suffer from newly introduced hallucinations with respect to their size and purpose.”Distillation itself does not necessarily increase the rate of hallucinations, but if the student model is significantly smaller, it might lack the capacity to capture all the nuances of the teacher model, potentially leading to more errors or oversimplifications,” Chandrasekaran said.When a model hallucinates, it can be exploited by threat actors to craft adversarial prompts that manipulate outputs, leading to misinformation campaigns or AI-driven exploits.An instance of model hallucination used by miscreants is the discovery of WormGPT in 2023, an AI system deliberately trained on unverified, potentially biased, and adversarial data to hallucinate legal terminologies, business processes, and financial policies to create convincing but completely fabricated phishing emails and scam content. Snatch AI made easy: Distilled models also lower the barriers for adversaries attempting model extraction attacks. By extensively querying these models, attackers can approximate their decision boundaries and recreate functionally similar models”, often with reduced security constraints.”Once an adversary has extracted a model, they can potentially modify it to bypass security measures or proprietary guidelines embedded in the original model,” Chandrasekaran said. “This could include altering the model’s behavior to ignore certain inputs or to produce outputs that align with the adversary’s goals.”Brauchler, however, argues that bypassing an AI model’s proprietary security guardrails is not the primary driver behind model extraction attacks using distilled models. “Model extraction is usually exploited with the intent of capturing a proprietary model’s performance, not with the express purpose of bypassing guardrails,” he said. “There are much less strenuous techniques to avoid AI guardrails.”Instead of using a distilled model for extraction, he explained, threat actors may disguise a malicious model as a crispier version, given that model extraction attacks closely resemble model distillation.One particular risk arises when proprietary models provide probability distributions (soft labels), as threat actors can leverage distillation methodologies to replicate the target model’s functional behavior. While similar attacks can be executed using only output labels, the absence of probability distributions significantly reduces their effectiveness, added Brauchler.To sum up, distillation can potentially expose models to extraction, either by serving as a cover for replicating a source model’s behavior in an extraction attack or by enabling post-distillation extraction attempts with security bypasses. They may not always have your back: Another downside to distillation is its interpretability. Large LLMs benefit from extensive logs and complex decision-making pathways that security teams can analyze for root cause investigation. Distilled models, however, often lack this granularity making it harder to diagnose vulnerabilities or trace security incidents.”In the context of incident response, the lack of detailed logs and parameters in student models can make it harder to perform root cause analysis,” Chandrasekaran said. “Security researchers might find it more difficult to pinpoint the exact conditions or inputs that led to a security incident or to understand how an adversary exploited a vulnerability.”This opacity complicates defensive strategies and forces security teams to rely on external monitoring techniques rather than internal AI audit trails. Fighting the AI curse: While security risks from distilled models are quite pressing, the broader risk remains the nascent state of AI security itself, which is a key driver of all these vulnerabilities.”AI guardrails remain soft defense-in-depth controls, not security boundaries,” Brauchler noted. “And as systems move toward agentic contexts, the AI engineering industry will quickly discover that relying on guardrails will result in deep, impactful security vulnerabilities in critical systems, as NCC Group has already observed across multiple application environments.”Only when developers change the way they think about AI application architectures, will we be able to move toward designing systems with trust-based access controls in mind, he added.
First seen on csoonline.com
Jump to article: www.csoonline.com/article/3951626/llms-are-now-available-in-snack-size-but-digest-with-care.html