Building RAG: All things retrieval

The previous article delves into the revolutionary impact of RAG systems on question-and-answer systems, and how reimagining a question-and-answer system as a reading comprehension system forms the foundation for the operation of RAG systems. This shifts the question-and-answer NLP task into a streamlined and productive search problem. In this article, we will delve deeper into implementing search.

To create a reading comprehension task you need to locate a passage of interest. Then you can prompt the language model by combining the user’s question and the relevant text asking the language model to answer the user’s question by comprehending the given passage. Similarly, when building an RAG system that attempts to answer a question, you need to find a relevant text from your corpus and merge the question and passage to form a prompt that can be addressed using a pre-trained language model.

How can we create an effective retrieval process?

Search is a research area that has been studied extensively and has experienced significant evolution in the last two decades. Research in this area has been influenced by changes in user habits. User search habits have shifted from keyword-based searches to more free-text queries. Especially, in a question-and-answer system, the user’s queries typically consist of free-form text queries.

Vector searches have proven to be useful for handling free-text queries by matching documents semantically, as opposed to the traditional keyword-based search. However, developing a vector search based on word embeddings from a pre-trained model has limitations when it comes to adapting to custom domains. This is because the pre-trained embedding model learns semantics from a broad perspective when trained on public data. For instance, terms like “leasing” and “buying” may have nuanced meanings within specific industries, such as automotive, which are not captured in the general training. To address this limitation, fine-tuning of pre-trained models with domain-specific data is necessary. Fine-tuning a pre-trained model needs a large amount of domain data.

Traditional keyword-based searches can adapt to new domains as you are matching documents based on keywords only, but they are inherently unsuitable for free-text queries. Both methods have their advantages and disadvantages. When fine-tuning is not an option due to lack of domain corpus, or solely focusing on fine-tuning an embedding model, there is potential in exploring a hybrid approach that combines both keyword-based and vector (embedding) searches using some ranking scheme.

Keyword search, even with a ranking function, has a limitation - it struggles with a vocabulary mismatch problem. This is a fundamental challenge in IR (information retrieval) systems, where people use different words to describe the same topic. So when matching the documents using keywords, we might consider word expansion of the query as a preprocessing step. You can maintain a dictionary of similar words and use them for term expansion. Traditional methods use a rule-based term expansion which is time-consuming and fundamentally limited. Welcome to Spade, Sparse Lexical, and Expansion models. The idea behind models like SPLADE is to use the language models to learn term expansions. Therefore, the utilization of a hybrid search combining both keyword and vector-based search along with term expansion proves to be an efficient approach for implementing retrieval.

How can we create an effective retrieval process?#

How can we create an effective retrieval process?