The limitations of model fine-tuning and RAG

By Jignesh Patel

The hype and awe around generative AI have waned to some extent. “Generalist” large language models (LLMs) like GPT-4, Gemini (formerly Bard), and Llama whip up smart-sounding sentences, but their thin domain expertise, hallucinations, lack of emotional intelligence, and obliviousness to current events can lead to terrible surprises. Generative AI exceeded our expectations until we needed it to be dependable, not just amusing.

In response, domain-specific LLMs have emerged, aiming to provide more credible answers. These LLM “specialists” include LEGAL-BERT for law, BloombergGPT for finance, and Google Research’s Med-PaLM for medicine. The open question in AI is how best to create and deploy these specialists. The answer may have ramifications for the generative AI business, which so far is frothy with valuations but dry of profit due to the monumental costs of developing both generalist and specialist LLMs.

To specialize LLMs, AI developers often rely on two key techniques: fine-tuning and retrieval-augmented generation (RAG). Each has limitations that have made it difficult to develop specialist LLMs at a reasonable cost. However, these limitations have informed new techniques that may change how we specialize LLMs in the near future.

Specialization is expensive

Today, the overall best performing LLMs are generalists, and the best specialists begin as generalists and then undergo fine-tuning. The process is akin to putting a humanities major through a STEM graduate degree. And like graduate programs, fine-tuning is time-consuming and expensive. It remains a choke point in generative AI development because few companies have the resources and know-how to build high-parameter generalists from scratch.

Think of an LLM as a big ball of numbers that encapsulates relationships between words, phrases, and sentences. The bigger the corpus of the text data behind those numbers, the better the LLM seems to perform. Thus, an LLM with 1 trillion parameters tends to outcompete a 70 billion parameter model on coherency and accuracy.

To fine-tune a specialist, we either adjust the ball of numbers or add a set of complementary numbers. For instance, to turn a generalist LLM into a legal specialist, we could feed it legal documents along with correct and incorrect answers about those documents. The fine-tuned LLM would be better at summarizing legal documents and answering questions about them.

Because one fine-tuning project with Nvidia GPUs can cost hundreds of thousands of dollars, specialist LLMs are rarely fine-tuned more than once a week or month. As a result, they’re rarely current with the latest knowledge and events in their field.

If there were a shortcut to specialization, thousands of enterprises could enter the LLM space, leading to more competition and innovation. And if that shortcut made specialization faster and less expensive, perhaps specialist LLMs could be updated continuously. RAG is almost that shortcut, but it, too, has limitations.

Learning from RAG

LLMs are always a step behind the present. If we prompted an LLM about recent events that it did not see during training, it either would refuse to answer or hallucinate. If I surprised a class of undergraduate computer science majors with exam questions about an unfamiliar topic, the result would be similar. Some wouldn’t answer, and some would fabricate reasonable-sounding answers. However, if I gave the students a primer about that new subject in the exam text, they might learn enough to answer correctly.

That is RAG in a nutshell. We enter a prompt and then give the LLM additional, relevant information with examples of right and wrong answers to augment what it will generate. The LLM won’t be as knowledgeable as a fine-tuned peer, but RAG can get an LLM up to speed at a much lower cost than fine-tuning.

Still, several factors limit what LLMs can learn via RAG. The first factor is the token allowance. With the undergrads, I could introduce only so much new information into a timed exam without overwhelming them. Similarly, LLMs tend to have a limit, generally between 4k and 32k tokens per prompt, which limits how much an LLM can learn on the fly. The cost of invoking an LLM is also based on the number of tokens, so being economical with the token budget is important to control the cost.

The second limiting factor is the order in which RAG examples are presented to the LLM. The earlier a concept is introduced in the example, the more attention the LLM pays to it in general. While a system could reorder retrieval augmentation prompts automatically, token limits would still apply, potentially forcing the system to cut or downplay important facts. To address that risk, we could prompt the LLM with information ordered in three or four different ways to see if the response is consistent. At that point, though, we get diminishing returns on our time and computational resources.

The third challenge is to execute retrieval augmentation such that it doesn’t diminish the user experience. If an application is latency sensitive, RAG tends to make latency worse. Fine-tuning, by comparison, has minimal effect on latency. It’s the difference between already knowing the information versus reading about it and then devising an answer.

One option is to combine techniques: Fine-tune an LLM first and then use RAG to update its knowledge or to reference private information (e.g., enterprise IP) that can’t be included in a publicly available model. Whereas fine-tuning is permanent, RAG retrains an LLM temporarily, which prevents one user’s preferences and reference material from rewiring the entire model in unintended ways.

Testing the limitations of fine-tuning and RAG have helped us refine the open question in AI: How do we specialize LLMs at a lower cost and higher speed without sacrificing performance to token limits, prompt ordering issues, and latency sensitivity?

Council of specialists

We know that a choke point in generative AI is the cost-effective development of specialist LLMs that provide reliable, expert-level answers in specific domains. Fine-tuning and RAG get us there but at too high a cost. Let’s consider a potential solution then. What if we skipped (most of) generalist training, specialized multiple lower-parameter LLMs, and then applied RAG?

In essence, we’d take a class of liberal arts students, cut their undergrad program from four years to one, and send them to get related graduate degrees. We’d then run our questions by some or all of the specialists. This council of specialists would be less computationally expensive to create and run.

The idea, in human terms, is that five lawyers with five years of experience each are more dependable than one lawyer with 50 years of experience. We’d trust that the council, though less experienced, has probably generated a correct answer if there’s widespread agreement among its members.

We are beginning to see experiments in which multiple specialist LLMs collaborate on the same prompt. So far, they’ve worked quite well. For instance, the code specialist LLM Mixtral uses a high-quality sparse mixture of experts model (SMoE) with eight separate LLMs. Mixtral feeds any given token into two models, the effect being that there are 46.7 billion total parameters but only 12.9 billion used per token.

Councils also remove the randomness inherent to using a single LLM. The probability that one LLM hallucinates is relatively high, but the odds that five LLMs hallucinate at once is lower. We can still add RAG to share new information. If the council approach ultimately works, smaller enterprises could afford to develop specialized LLMs that outmatch fine-tuned specialists and still learn on the fly using RAG.

For human students, early specialization can be problematic. Generalist knowledge is often essential to grasp advanced material and put it into a broader context. The specialist LLMs, however, wouldn’t have civic, moral, and familial responsibilities like human beings. We can specialize them young without stressing about the resulting deficiencies.

One or many

Today, the best approach to training a specialist LLM is to fine-tune a generalist. RAG can temporarily increase the knowledge of an LLM, but because of token limitations, that added knowledge is shallow.

Soon, we may skip generalist training and develop councils of more specialized, more computing-efficient LLMs enhanced by RAG. No longer will we depend on generalist LLMs with extraordinary abilities to fabricate knowledge. Instead, we’ll get something like the collective knowledge of several well-trained, young scholars.

While we should be careful about anthropomorphizing LLMs—or ascribing machine-like qualities to humans—some parallels are worth noting. Counting on one person, news source, or forum for our knowledge would be risky, just as depending on one LLM for accurate answers is risky.

Conversely, brainstorming with 50 people, reading 50 news sources, or checking 50 forums introduces too much noise (and labor). Same with LLMs. There is likely a sweet spot between one generalist and too many specialists. Where it sits, we don’t know yet, but RAG will be even more useful once we find that balance.

Dr. Jignesh Patel is a co-founder of DataChat and professor at Carnegie Mellon University.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

© Info World