Introduction

Objective
Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM). When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.
Definitions
- LLM – Large Language Model
- Llama 2.0 – LLM from Meta
- Langchain – a framework designed to simplify the creation of applications using LLMs
- Vector database – a database that organizes data through high-dimmensional vectors
- ChromaDB – vector database
- RAG – Retrieval Augmented Generation (see below more details about RAGs)
Model details
- Model: Llama 2
- Variation: 7b-chat-hf (7b: 7B dimm. hf: HuggingFace build)
- Version: V1
- Framework: PyTorch
LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.
What is a Retrieval Augmented Generation (RAG) system?
Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do “not know”, i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are so a retriever and a generator.
The retriever part can be described as a system that can encode our data so that it can be easily retrieved the relevant parts of it upon querying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).
For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.
The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.
More about this
Do you want to learn more? Look into the References section for blog posts and in More work on the same topic for Notebooks about the technologies used here.
Installations, imports, utils
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
Initialize model, tokenizer, query pipeline
linkcode
Define the model, the device, and the bitsandbytes configuration.
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
Prepare the model and the tokenizer
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")
Define the query pipeline.
time_1 = time()
query_pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")
Prepare pipeline: 1.77 sec.
linkcode
We define a function for testing the pipeline.
def test_model(tokenizer, pipeline, prompt_to_test):
"""
Perform a query
print the result
Args:
tokenizer: the tokenizer
pipeline: the pipeline
prompt_to_test: the prompt
Returns
None
"""
# adapted from https://huggingface.co/blog/llama2#using-transformers
time_1 = time()
sequences = pipeline(
prompt_to_test,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,)
time_2 = time()
print(f"Test inference: {round(time_2-time_1, 3)} sec.")
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Test the query pipeline
We test the pipeline with a query about the meaning of State of the Union (SOTU).
test_model(tokenizer,
query_pipeline,
"Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")
Retrieval Augmented Generation
Check the model with a HuggingFace pipeline
We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")
Ingestion of data using Text loder
We will ingest the newest presidential address, from Jan 2023.
loader = TextLoader("/kaggle/input/president-bidens-state-of-the-union-2023/biden-sotu-2023-planned-official.txt",
encoding="utf8")
documents = loader.load()
Split data in chunks
We split data in chunks using a recursive character text splitter.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)
Creating Embeddings and Storing in Vector Store
Create the embedding using Sentence Transformer and Hugging Face embedding.
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")
Initialize Chain
retriever = vectordb.as_retriever()
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
verbose=True
)
Test the Retrieval-Augmented Generation
We define a test function, that will run the query and time it.
def test_rag(qa, query):
print(f"Query: {query}\n")
time_1 = time()
result = qa.run(query)
time_2 = time()
print(f"Inference time: {round(time_2-time_1, 3)} sec.")
print("\nResult: ", result)
Let’s check few queries.
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)
Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.
> Entering new RetrievalQA chain...
Batches: 100%
1/1 [00:00<00:00, 42.99it/s]
> Finished chain.
Inference time: 11.964 sec.
Result: The State of the Union in 2023 focused on several key topics, including the nation's economic strength, the competition with China, and the need to come together as a nation to face the challenges ahead. The President emphasized the importance of American innovation, industries, and military modernization to ensure the country's safety and stability. The President also highlighted the nation's resilience and optimism, urging Americans to see each other as fellow citizens and to work together to overcome the challenges facing the country.
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)
Query: What is the nation economic status? Summarize. Keep it under 200 words. > Entering new RetrievalQA chain...
Batches: 100%
1/1 [00:00<00:00, 44.27it/s]
> Finished chain.
Inference time: 10.907 sec.
Result: The nation's economic status is strong, with a low unemployment rate of 3.4%, near record lows for Black and Hispanic workers, and fastest growth in 40 years in manufacturing jobs. The president highlights the progress made in creating good-paying jobs, exporting American products, and reducing inflation. However, the president acknowledges there is still more work to be done to fully recover from the pandemic and Putin's war.
Document sources
Let’s check the documents sources, for the last query run.
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
doc_details = doc.to_json()['kwargs']
print("Source: ", doc_details['metadata']['source'])
print("Text: ", doc_details['page_content'], "\n")
Batches: 100%
1/1 [00:00<00:00, 44.59it/s]
Query: What is the nation economic status? Summarize. Keep it under 200 words. Retrieved documents: 4 Source: /kaggle/input/president-bidens-state-of-the-union-2023/biden-sotu-2023-planned-official.txt Text: forward. Of never giving up. A story that is unique among all nations. We are the only country that has emerged from every crisis stronger than when we entered it. That is what we are doing again. Two years ago, our economy was reeling. As I stand here tonight, we have created a record 12 million new jobs, more jobs created in two years than any president has ever created in four years. Two years ago, COVID had shut down our businesses, closed our schools, and robbed us of so much. Today, COVID no longer controls our lives. And two years ago, our democracy faced its greatest threat since the Civil War. Today, though bruised, our democracy remains unbowed and unbroken. As we gather here tonight, we are writing the next chapter in the great American story, a story of progress and resilience. When world leaders ask me to define America, I define our country in one word: Possibilities. You know, we’re often told that Democrats and Republicans can’t work together. But over these past two Source: /kaggle/input/president-bidens-state-of-the-union-2023/biden-sotu-2023-planned-official.txt Text: on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As I stand here tonight, I have never been more optimistic about the future of America. We just have to remember who we are. We are the United States of America and there is nothing, nothingbeyond our capacity if we do it together. May God bless you all. May God protect our troops. Source: /kaggle/input/president-bidens-state-of-the-union-2023/biden-sotu-2023-planned-official.txt Text: Americans, we meet tonight at an inflection point. One of those moments that only a few generations ever face, where the decisions we make now will decide the course of this nation and of the world for decades to come. We are not bystanders to history. We are not powerless before the forces that confront us. It is within our power, of We the People. We are facing the test of our time and the time for choosing is at hand. We must be the nation we have always been at our best. Optimistic. Hopeful. Forward-looking. A nation that embraces, light over darkness, hope over fear, unity over division. Stability over chaos. We must see each other not as enemies, but as fellow Americans. We are a good people, the only nation in the world built on an idea. That all of us, every one of us, is created equal in the image of God. A nation that stands as a beacon to the world. A nation in a new age of possibilities. So I have come here to fulfil my constitutional duty to report on the state of the Source: /kaggle/input/president-bidens-state-of-the-union-2023/biden-sotu-2023-planned-official.txt Text: about being able to look your kid in the eye and say, “Honey –it’s going to be OK,” and mean it. So, let’s look at the results. Unemployment rate at 3.4%, a 50-year low. Near record low unemployment for Black and Hispanic workers. We’ve already created 800,000 good-paying manufacturing jobs, the fastest growth in 40 years. Where is it written that America can’t lead the world in manufacturing again? For too many decades, we imported products and exported jobs. Now, thanks to all we’ve done, we’re exporting American products and creating American jobs. Inflation has been a global problem because of the pandemic that disrupted supply chains and Putin’s war that disrupted energy and food supplies. But we’re better positioned than any country on Earth. We have more to do, but here at home, inflation is coming down. Here at home, gas prices are down $1.50 a gallon since their peak. Food inflation is coming down. Inflation has fallen every month for the last six months while take home pay
Conclusions
We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.
More work on the same topic
You can find more details about how to use a LLM with Kaggle. Few interesting topics are treated in:
- https://www.kaggle.com/code/gpreda/test-llama-2-quantized-with-llama-cpp (quantizing LLama 2 model using llama.cpp)
- https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp (quantized Llamam 2 model using llama.cpp)
- https://www.kaggle.com/code/gpreda/test-of-llama-2-quantized-with-llama-cpp-on-cpu (quantized model using llama.cpp – running on CPU)
- https://www.kaggle.com/code/gpreda/explore-enron-emails-with-langchain-and-llama-v2 (Explore Enron Emails with Langchain and Llama v2)
References
[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476
[2] Patrick Lewis, Ethan Perez, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, https://browse.arxiv.org/pdf/2005.11401.pdf
[3] Minhajul Hoque, Retrieval Augmented Generation: Grounding AI Responses in Factual Data, https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322
[4] Fangrui Liu , Discover the Performance Gain with Retrieval Augmented Generation, https://thenewstack.io/discover-the-performance-gain-with-retrieval-augmented-generation/
[5] Andrew, How to use Retrieval-Augmented Generation (RAG) with Llama 2, https://agi-sphere.com/retrieval-augmented-generation-llama2/
[6] Yogendra Sisodia, Retrieval Augmented Generation Using Llama2 And Falcon, https://medium.com/@scholarly360/retrieval-augmented-generation-using-llama2-and-falcon-ed26c7b14670