Are you interested in creating your own PDF chatbot but want to have full control over every aspect of the bot? Look no further! In this tutorial, we will guide you through the process of building a PDF chatbot without depending on libraries that have excessive abstraction. By following this approach, not only will you ensure the privacy of your data, but you will also have complete control over preprocessing, top results, search functionalities, and more. We will be utilizing two different models, falcon-7b and falcon-40b, to achieve our goal.
Table of Contents
Introduction
Chatbots have become increasingly popular in recent years, providing automated responses and assistance in various domains. However, many existing chatbot libraries come with high-level abstractions that limit customization and control. To address this limitation, we will explore a method to create a PDF chatbot with fine-grained control.
Prerequisites
Before we dive into the code, make sure you have the following prerequisites:
1. Python: Make sure you have Python installed on your system.
2. PDFMiner: Install the PDFMiner library, which provides tools for extracting text from PDF documents.
pip install pdfminer.six
3. Sentence Transformers: Install the Sentence Transformers library, which offers an easy way to compute embeddings for sentences and paragraphs.
pip install sentence-transformers
Code Implementation
Let’s start by setting up our code. Open your preferred Python editor and create a new file. Copy and paste the following code into the file:
import argparse from pdfminer.high_level import extract_text from sentence_transformers import SentenceTransformer, CrossEncoder, util from text_generation import Client PREPROMPT = "Below are a series of dialogues between various people and an AI assistant. The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable. The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed. It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. That said, the assistant is practical and really does its best, and doesn't let caution get too much in the way of being useful.\n" PROMPT = """"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Don't make up new terms which are not available in the context. {context}""" END_7B = "\n{query}" END_40B = "\nUser: {query}\nFalcon:" PARAMETERS = { "temperature": 0.9, "top_p": 0.95, "repetition_penalty": 1.2, "top_k": 50, "truncate": 1000, "max_new_tokens": 1024, "seed": 42, "stop_sequences": ["", "</s>"], } CLIENT_7B = Client("http://") # Fill this part CLIENT_40B = Client("https://") # Fill this part def parse_args(): parser = argparse.ArgumentParser() parser.add_argument("--fname", type=str, required=True) parser.add_argument("--top-k", type=int, default=32) parser.add_argument("--window-size", type=int, default=128) parser.add_argument("--step-size", type=int, default=100) return parser.parse_args() def embed(fname, window_size, step_size): text = extract_text(fname) text = " ".join(text.split()) text_tokens = text.split() sentences = [] for i in range(0, len(text_tokens), step_size): window = text_tokens[i: i + window_size] if len(window) < window_size: break sentences.append(window) paragraphs = [" ".join(s) for s in sentences] model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") model.max_seq_length = 512 cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") embeddings = model.encode( paragraphs, show_progress_bar=True, convert_to_tensor=True, ) return model, cross_encoder, embeddings, paragraphs def search(query, model, cross_encoder, embeddings, paragraphs, top_k): query_embeddings = model.encode(query, convert_to_tensor=True) query_embeddings = query_embeddings.cuda() hits = util.semantic_search( query_embeddings, embeddings, top_k=top_k, )[0] cross_input = [[query, paragraphs[hit["corpus_id"]]] for hit in hits] cross_scores = cross_encoder.predict(cross_input) for idx in range(len(cross_scores)): hits[idx]["cross_score"] = cross_scores[idx] results = [] hits = sorted(hits, key=lambda x: x["cross_score"], reverse=True) for hit in hits[:5]: results.append(paragraphs[hit["corpus_id"]].replace("\n", " ")) return results if __name__ == "__main__": args = parse_args() model, cross_encoder, embeddings, paragraphs = embed( args.fname, args.window_size, args.step_size, ) print(embeddings.shape) while True: print("\n") query = input("Enter query: ") results = search( query, model, cross_encoder, embeddings, paragraphs, top_k=args.top_k, ) query_7b = PREPROMPT + PROMPT.format(context="\n".join(results)) query_7b += END_7B.format(query=query) query_40b = PREPROMPT + PROMPT.format(context="\n".join(results)) query_40b += END_40B.format(query=query) text = "" for response in CLIENT_7B.generate_stream(query_7b, **PARAMETERS): if not response.token.special: text += response.token.text print("\n***7b response***") print(text) text = "" for response in CLIENT_40B.generate_stream(query_40b, **PARAMETERS): if not response.token.special: text += response.token.text print("\n***40b response***") print(text)
Explanation
Let’s go through the code to understand how it works.
1. Importing Required Libraries
We begin by importing the necessary libraries for our chatbot implementation. We import the `argparse` module for command-line argument parsing, `extract_text` from `pdfminer.high_level` to extract text from a PDF, and `SentenceTransformer`, `CrossEncoder`, and `util` from the `sentence_transformers` library for generating sentence embeddings and performing a semantic search.
import argparse from pdfminer.high_level import extract_text from sentence_transformers import SentenceTransformer, CrossEncoder, util from text_generation import Client
2. Setting Up Constants and Parameters
Next, we define some constants and parameters that will
be used throughout the code. These constants and parameters include:
– `PREPROMPT`: A string containing introductory text for the chatbot’s responses.
– `PROMPT`: A string template for the prompt that will be used for generating responses based on context.
– `END_7B` and `END_40B`: Strings indicating the end of the prompt for the falcon-7b and falcon-40b models, respectively.
– `PARAMETERS`: A dictionary containing various parameters for generating text, such as temperature, top-p, repetition penalty, top-k, truncate length, maximum number of new tokens, seed, and stop sequences.
– `CLIENT_7B` and `CLIENT_40B`: Instances of the `Client` class from the `text_generation` module, initialized with the appropriate URLs.
PREPROMPT = "Below are a series of dialogues..." # Add the full introductory text PROMPT = """"Use the following pieces of context..." # Add the prompt template END_7B = "\n{query}" END_40B = "\nUser: {query}\nFalcon:" PARAMETERS = { "temperature": 0.9, "top_p": 0.95, "repetition_penalty": 1.2, "top_k": 50, "truncate": 1000, "max_new_tokens": 1024, "seed": 42, "stop_sequences": ["", "</s>"], } CLIENT_7B = Client("http://") # Fill in the appropriate URL CLIENT_40B = Client("https://") # Fill in the appropriate URL
3. Command-Line Argument Parsing
The `parse_args()` function is responsible for parsing the command-line arguments provided to the script. In this case, we are expecting the `–fname` argument, which represents the path to the PDF file we want to extract text from, as well as optional arguments `–top-k`, `–window-size`, and `–step-size`.
def parse_args(): parser = argparse.ArgumentParser() parser.add_argument("--fname", type=str, required=True) parser.add_argument("--top-k", type=int, default=32) parser.add_argument("--window-size", type=int, default=128) parser.add_argument("--step-size", type=int, default=100) return parser.parse_args()
4. PDF Text Extraction and Embedding
The `embed()` function takes the path to a PDF file, window size, and step size as inputs. It uses the `extract_text()` function from `pdfminer.high_level` to extract the text content from the PDF. The extracted text is then split into tokens and organized into sentences based on the window size and step size provided.
Next, the function initializes a Sentence Transformer model and a Cross Encoder model using the Sentence Transformers library. These models are used to compute embeddings for the extracted sentences and perform semantic search later on.
The function returns the Sentence Transformer model, Cross Encoder model, embeddings, and paragraphs (organized sentences) for further processing.
def embed(fname, window_size, step_size): text = extract_text(fname) text = " ".join(text.split()) text_tokens = text.split() sentences = [] for i in range(0, len(text_tokens), step_size): window = text_tokens[i: i + window_size] if len(window) < window_size: break sentences.append(window) paragraphs = [" ".join(s) for s in sentences] model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") model.max_seq_length = 512 cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") embeddings = model.encode( paragraphs, show_progress_bar=True, convert_to_tensor=True, ) return model, cross_encoder, embeddings, paragraphs
5. Semantic Search
The `search()` function takes a query, the Sentence Transformer model, Cross Encoder model, embeddings, paragraphs, and a top-k value as inputs. It performs semantic search by computing the embeddings of the query and comparing them with the embeddings of the paragraphs using the `semantic_search()` function from the Sentence Transformers library.
The function returns a list of the top-k results based on the semantic similarity between the query and the paragraphs.
def search(query, model, cross_encoder, embeddings, paragraphs, top_k): query_embeddings = model.encode(query, convert_to_tensor=True) query_embeddings = query_embeddings.cuda() hits = util.semantic_search( query_embeddings, embeddings, top_k=top_k, )[0] cross_input = [[query, paragraphs[hit["corpus_id"]]] for hit in hits] cross_scores = cross_encoder.predict(cross_input) for idx in range(len(cross_scores)): hits[idx]["cross_score"] = cross_scores[idx] results = [] hits = sorted(hits, key=lambda x: x["cross_score"], reverse=True) for hit in hits[:5]: results.append(paragraphs[hit["corpus_id"]].replace("\n", " ")) return results
6. Main Execution
In the main execution part, we parse the command-line arguments, extract the PDF text and create embeddings using the `embed()` function. We then enter an infinite loop where the user can enter queries.
For each query, we call the `search()` function to retrieve the top results based on semantic search. We then construct prompts for the falcon-7b and falcon-40b models using the retrieved results and the user’s query.
Finally, we generate responses using the falcon-7b and falcon-40b models by making requests to the respective clients (`CLIENT_7B` and `CLIENT_40B`) and print the responses.
if __name__ == "__main__": args = parse_args() model, cross_encoder, embeddings, paragraphs = embed( args.fname, args.window_size, args.step_size, ) print(embeddings.shape) while True: print("\n") query = input("Enter query: ") results = search( query, model, cross_encoder, embeddings, paragraphs, top_k=args.top_k, ) query_7b = PREPROMPT + PROMPT.format(context="\n".join(results)) query_7b += END_7B.format(query=query) query_40b = PREPROMPT + PROMPT.format(context="\n".join(results)) query_40b += END_40B.format(query=query) text = "" for response in CLIENT_7B.generate_stream(query_7b, **PARAMETERS): if not response.token.special: text += response.token.text print("\n***7b response***") print(text) text = "" for response in CLIENT_40B.generate_stream(query_40b, **PARAMETERS): if not response.token.special: text += response.token.text print("\n***40b response***") print(text)
Conclusion
In this tutorial, we have learned how to create our own PDF chatbot without
relying on libraries with excessive abstraction. By utilizing the falcon-7b and falcon-40b models, we have full control over the bot’s behavior and can customize various aspects such as preprocessing, top results, and search functionalities. This approach ensures data privacy and allows us to tailor the chatbot according to our specific requirements. Feel free to experiment with different models, parameters, and prompt structures to further enhance the capabilities of your PDF chatbot. Happy coding!
Watch this video for reference – https://www.youtube.com/watch?v=hSQY4N1u3v0
Leave a Reply