Topic 9: What is Speculative RAG

We explore the recent Speculative RAG idea, highlighting where it excels, and discuss the limitations of other RAG systems. At the end, you'll find a list of useful resources

New day, new RAG research! We've already dedicated many articles to the Retrieval-Augmented Generation (RAG) approach and its various modifications, but honestly, it's a topic that never gets old. As one of the most popular methods for enhancing LLMs with external knowledge and retrieving the information you actually need, RAG is continually evolving. Researchers are actively working on modifications to address its limitations, such as long processing times and difficulties with handling extended contexts. But with so much research out there, it can be challenging to know what to focus on. That's where we come in. Speculative RAG is one of the latest approaches, aiming to balance efficiency and effectiveness by using two types of language models in its architecture. The researchers behind it have come up with an intriguing idea: leveraging a small language model within a larger framework. Let’s explore how they arrived at this concept of Speculative RAG.

In today’s episode, we will cover:

  • Overview of existing RAG systems and their limitations

  • Here comes Speculative RAG

  • How does Speculative RAG work?

  • How good is Speculative RAG?

  • Where Speculative RAG excels

  • Not without limitations…

  • Conclusion

  • Bonus: Resources

Overview of existing RAG systems and their limitations

As you probably know, RAG systems combine large language models (LLMs) with external information sources to answer queries. These systems aim to enhance the accuracy and relevance of responses by incorporating data retrieved from external databases.

However, existing RAG approaches have limitations: 

  • Original RAG incorporates all retrieved documents directly into the prompt, leading to increased length and slower response times. In other words, it struggles with long context window.

  • LongRAG deals with the long context window (we covered it here) but struggles with the limitations of current long embedding models, the inefficiencies of black-box LLMs for handling extended inputs, and its reliance on Wikipedia-specific grouping methods, which limit its generalizability.

  • Graph RAG, a popular method from one of our previous episodes, organizes data into a graph structure, representing text data and its interrelations. Graph-based models may struggle with non-stationary data where the relationships between variables change over time

  • Self-Reflective RAG requires specialized instruction-tuning of the general-purpose LMs to generate specific tags for self-reflection. However, it needs additional tuning, which can be complex and need a lot of resources.

  • Corrective RAG uses an external retrieval evaluator to refine the quality of retrieved documents. It focuses solely on improving the contextual information. The problem is that it doesn’t enhance the model's reasoning capabilities.

Image credit: Speculative RAG paper

These methods often struggle with balancing efficiency and effectiveness. It’s difficult for them to do all at the same time – handling long contexts and ensuring diverse perspectives while maintaining speed and accuracy.

Here comes Speculative RAG

The researchers from the University of California, San Diego, and Google came up with a thought: what if a smaller, specialized language model could efficiently draft multiple answers from different document subsets, and then a larger, generalist language model could simply verify these drafts? How will it improve RAG approach?

The rest of this article, with detailed explanations and best library of relevant resources, is available to our Premium users only –>

Share this article with three friends and get a 1-month subscription for free! 🤍

Reply

or to participate.