Querying in Human Language over any set of Documents, leveraging LangChain, LLMs and Vector Databases

Querying in Human Language over any set of Documents, leveraging LangChain, LLMs and Vector Databases
Play this article

So I won't straight dive into the Tech and What I built, rather let me introduce you to the problem first and what this system aims to cater

How I've Presented this is like :

How would we as normal human beings ask something to someone?

Now the comparison I've tried to draw is how existing Traditional Search built using Textual Similarity Algorithms and to an extent, Semantic and Contextual Similarity Algorithms and the system I have built would behave


Groceries in Hinglish

Query: "Chai ke saath khane waala nashta"

Traditional Search Response: Nothinh

How the System being built would Respond ?
Ji, aapki chai ke saath nashta ki demand sun kar mujhe samajh aa gaya ki aap kya dhundh rahe hain.
Humare yahan aapko milenge Bliss Tree ke Kodo Millet Cookies, jo ki aapki chai ke saath perfect match honge. Ye cookies millets se bane hote hain jo ki humare health ke liye bahut acche hote hain. Ye diabetes ko control karne mein help karte hain, blood pressure ko lower karte hain aur heart attacks ka risk bhi kam karte hain. Ye cookies aapko sirf Rs.40 mein milenge.

Aur ek option hai aapke liye Unibic ke Ragi Cookies. Ye cookies bhi aapki chai ke saath bahut acche match karte hain. Inme ragi hota hai jo ki fibre aur protein se bharpur hota hai. Ye aapko Rs.30 mein milenge.

Aur agar aap chai ki jagah kuch alag try karna chahte hain, toh humare paas hai Emperia ka Kashmiri Kahwa Tea. Ye green tea Indian spices, nuts aur saffron ke saath blend hoti hai. Ye aapko post meals ke liye perfect drink provide karegi. Ye aapko sirf Rs.159 mein mil jayega. Toh aap humse in products ko order kar sakte hain, aur hum aapko ghar tak delivery kar denge. Dhanyawaad!


Query: "I need to understand the implications of force majeure in my rental agreement."

Traditional Search Response: Returns documents containing the exact phrase "force majeure."

How the System being built would Respond ? : "As per Indian contract law, the force majeure clause in your rental agreement refers to unforeseen events like floods, earthquakes, or pandemics, which can relieve both you and the landlord from performing contractual duties without penalty. This clause is detailed in section 56 of the Indian Contract Act, 1872, and can be found under 'Clause 18: Force Majeure' in your specific agreement. It outlines the procedures for notification and steps to be taken if such an event occurs."


Financial Documents

Query: "What are the tax exemptions available for a new startup?"

Traditional Search Response: Lists documents with the keyword "tax exemptions."

How the System being built would Respond ? : "New startups can benefit from tax exemptions under the Small Business Act for the first three years, especially if they're in the tech sector. Refer to the document 'Tax Code 2021, Section 12A' for a detailed list of eligible exemptions."


Groceries

Query: "I'm looking for a dairy-free, high-protein breakfast option."

Traditional Search Response: May show all breakfast items that are tagged dairy-free or high-protein.

How the System being built would Respond ? : "The 'Almond Butter Protein Bar' is an excellent choice for a dairy-free, high-protein breakfast. Each bar contains 10 grams of plant-based protein and is made with natural ingredients without any dairy products."


Cosmetics

Query: "I want a moisturizer that's good for dry climates and sensitive skin."

Traditional Search Response: Returns products with tags like "moisturizer," "dry climate," or "sensitive skin."

How the System being built would Respond ? : "Our 'Hydra-Soothe Cream' is specifically formulated for dry climates and sensitive skin, featuring hyaluronic acid and chamomile extract to provide deep hydration without irritation. It's recommended by dermatologists for similar conditions."


Electronics

Query: "Which laptop is best suited for graphic design work?"

Traditional Search Response: Lists laptops with keywords like "graphic design" or "high performance."

How the System being built would Respond ? : "The 'ArtGraph X-Pro' laptop is favored among graphic designers for its 4K OLED display and NVIDIA RTX 3070 GPU, which provide exceptional color accuracy and rendering speeds for design software. It also comes with a color calibration report."

I hope I have your Interest now

Are You Ready GIFs | GIFDB.com


Let's Start

How did I design this?

My use case focuses on answering questions over specific documents, relying solely on the information within those documents to generate accurate and context-aware answers.

LangChain is a powerful framework designed for developing applications driven by language models, while Qdrant serves as an efficient vector database for building high-performance vector search applications, which is built over Rust.

By combining the power of semantic search with the impressive capabilities of LLMs like GPT, I will demonstrate how to build a state-of-the-art Query Engine where we can query over any set of Documents : pdf, docx, json, csv, html, Audio, Video Transcriptions in Human Language

Smart GIFs - Get the best gif on GIFER

Design

Components

  • Document Loader :
    We Use document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading a transcript of a YouTube video.

      from langchain.document_loaders import PyPDFLoader
    
      loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
      pages = loader.load_and_split()
    
      pages[0]
    
      ##Output 
      # Document(page_content='LayoutParser : A Uni\x0ced Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\nfmelissadell,jacob carlson g@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model con\x0cgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\ne\x0borts to improve reusability and simplify deep learning (DL) model\ndevelopment in disciplines like natural language processing and computer\nvision, none of them are optimized for challenges in the domain of DIA.\nThis represents a major gap in the existing toolkit, as DIA is central to\nacademic research across a wide range of disciplines in the social sciences\nand humanities. This paper introduces LayoutParser , an open-source\nlibrary for streamlining the usage of DL in DIA research and applica-\ntions. The core LayoutParser library comes with a set of simple and\nintuitive interfaces for applying and customizing DL models for layout de-\ntection, character recognition, and many other document processing tasks.\nTo promote extensibility, LayoutParser also incorporates a community\nplatform for sharing both pre-trained models and full document digiti-\nzation pipelines. We demonstrate that LayoutParser is helpful for both\nlightweight and large-scale digitization pipelines in real-word use cases.\nThe library is publicly available at https://layout-parser.github.io .\nKeywords: Document Image Analysis ·Deep Learning ·Layout Analysis\n·Character Recognition ·Open Source library ·Toolkit.\n1 Introduction\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of\ndocument image analysis (DIA) tasks including document image classi\x0ccation [ 11,arXiv:2103.15348v2  [cs.CV]  21 Jun 2021', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': 0})
    
  • Text Splitter
    When we want to deal with long pieces of text, it is necessary to split up that text into chunks. there is a lot of potential complexity here. Ideally, we want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text.

    At a high level, text splitters work as following:

    1. Split the text up into small, semantically meaningful chunks (often sentences).

    2. Start combining these small chunks into a larger chunk until we reach a certain size (as measured by some function).

    3. Once we reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which we can customize our text splitter:

  1. How the text is split

  2. How the chunk size is measured

I have used Token Splitters

Language models have a token limit. We should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When we count tokens in our text we should use the same tokenizer as used in the language model.

I have used tiktoken

tiktoken is a fast BPE tokenizer created by OpenAI.

We can use it to estimate tokens used. It can be more accurate for the OpenAI models.

  1. How the text is split: by character passed in.

  2. How the chunk size is measured: by tiktoken tokenizer.

    # This is a long document we can split up.
    with open("../../../state_of_the_union.txt") as f:
        state_of_the_union = f.read()
    from langchain.text_splitter import CharacterTextSplitter
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=100, chunk_overlap=0
    )
    texts = text_splitter.split_text(state_of_the_union)
    print(texts[0])    

    #Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  
    #Last year COVID-19 kept us apart. This year we are finally together again. 
    #Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 
    #With a duty to one another to the American people to the Constitution.
  • Embedders

    I have used embedding model text-embedding-ada-002
    for generating vector embeddings

    I've provided extremely detailed explanation and working over my article here :

    How Vector Embeddings and Vector Databases works?

      embedder = OpenAIEmbeddings(model = OPEN_AI_EMBEDDING_MODEL_ADA_002, 
                                  openai_api_key = OPENAI_AI_API_KEY)
    
      text_vector = embedder.embed_query(query_string)
    
  • Vector Database
    I am using Qdrant as Vector DB, It's written in Rust and outperforms almost every other vector database available in terms of query time and the Indexing Algorithms supported
    I've provided an extremely detailed explanation and working on my article here :

    How Vector Embeddings and Vector Databases works?

  • LLM for sanitizing queries : As I am directly using the Vector Databases to make the query, I need to Sanatize for Spell Correction, Query Streamlining, Enhancement without Stopwords and Cultural and Contextual Relevance to make queries more effective.

    The model I am using : gpt-3.5-turbo

  • LLM with set Context

    The model I am using : gpt-4-0613


What does the End Product look like ?

Sales GIFs - Get the best GIF on GIPHY

I used FastAPI to serve the requests over my local and wrap it over as an API, this is how i make the queries and get the result

I used BigBasket 28,000 items Dataset, to index over Vector DB and make queries over it

Results :

Query : Daaru ke saath Chakhna

Response :

Arre Bhaiya, aapko daaru ke saath chakhna chahiye, toh humare paas bahut acche options hai.

Ek toh hai humara "Chips - Pita, Parmesan And Mediterranean", ye ekdam crispy aur tasty hai, Parmesan cheese aur Mediterranean herbs ke saath. Aur sabse acchi baat ye hai ki ye low carbs aur low fat me bhi available hai. Aap isko Rs.40 me le sakte hain.

Dusra option hai "Potato Crisps Pizza flavour" , ye bhi ek bahut hi accha option hoga aapke liye. Isme pizza ka flavour hai, toh aapko lag hi nahi raha hoga ki aap chips kha rahe hain, balki pizza kha rahe hain. Ye bhi aapko sirf Rs.91.08 me mil jayega.

Aur haan, agar aapko meetha pasand hai, toh humare paas Chocolate - Rum & Raisin bhi hai. Ye ek premium chocolate hai rum and raisin flavour ke saath, jo aapko sirf Rs.100 me mil jayega.

Toh boliye, aapko kya chahiye?


Query : Morning Breakfast

Response :

For your 'Morning Breakfast' requirement,

I would recommend the following items from our inventory.

Oats : This is a versatile and healthy breakfast option. You can cook it with milk or water and add fruits, nuts or seeds for extra nutrition. Oats are a great source of soluble fibre and can keep you feeling full for longer, preventing unhealthy snacking.

Fruit N Fibre Muesli - Strawberry : This is a delicious and nutritious ready-to-eat breakfast option. The muesli contains British oats, wheat flakes, almonds and raisins, all blended with a delicious strawberry crush. It's a great way to start your day on a sweet note.

Multi Grain Idli Dosa Batter : If you prefer a traditional Indian breakfast, this multi grain idli dosa batter is an excellent choice. It's easy to cook and provides a healthy, tasty breakfast in no time. The batter contains sprouted ragi and brown rice, which are good sources of dietary fibre and support gut health.


Query : flower smell soap

Response :

Based on your request for a "flower smell soap" :

I would recommend our "& Moms Bathing Soap - Jasmine" [1]. This soap is made with minimal chemicals and is free from parabens, harsh chemical residues, artificial dyes, and talc. It also has a fragrance infused with 100% natural jasmine essence, which gives it a lovely floral smell. It is priced at Rs.108.0.

However, if you prefer a citrus fragrance, we also have "Citrus Blast Orange Lemon Soap Free Body Wash" [2] and "Lemon & Tea Tree Oil Soap" [3]. The Citrus Blast body wash is a soap-free formula with Orange and Lemon extracts, which is rich in Vitamin C and priced at Rs.225.0. The Lemon & Tea Tree Oil Soap, priced at Rs.360.0, gives a freshness that lasts longer with a fragrance of Lemon and the benefits of tea tree oil [3].


Query : snacks to have with tea

Response :

Based on your query, here are the most relevant items from our inventory that you can have with tea:

'Cookies - Free, With Chocolate & Cranberry'[1]: These sugar-free cookies from Poland are a perfect accompaniment for your tea time. They contain 10% sugar-free chocolate, 9% dried cranberries and pasteurised egg mass, giving your taste buds just the right uncensored taste.

'Cookies - Kodo Millet'[2]: Bliss Tree Cookies are another great option for your tea time. They contain the goodness of millets which are easy to digest and can minimize the risk of diabetes. They are delicious, healthy, pure and GMO-free.

'Cookies - My Healthy Ragi'[3]: Unibic Ragi Cookies are a delicious and healthy snack, packed with the goodness of ragi, which is high in fibre and protein. They are an ideal tea time snack, which can also be relished on-the-go or as a midday snack.


Response :

Based on your request for 'Momos dip' :

The most relevant items from our inventory are:

Mango Chilli Sauce [4]: This sauce is made from sun-ripened mangoes, blended with red chillies, and it can be an excellent dip for momos. It adds a sweet and spicy flavor to your dish. It is an essential dipping sauce for various dishes including momos.

Please note that the other items in the references provided are not directly related to 'Momos dip'. However, if you're looking for a variety of sauces and dips, you can check our category 'Gourmet & World Food' under the sub-category 'Sauces, Spreads & Dips' on our website for more options.


Next, I used a company's multipage document on Safety and Training for their Warehouse workers,

to index over Vector DB and make queries over it

Results :

Query : what to do if electrical cords are damaged

Response :

"In the event of damaged electrical cords, you should adhere to the Electrical Safety Procedures outlined in the Blinkit Dark Stores Safety Training Video Script.
specifically, under the section 'Awareness of Electrical Hazards', it's stated that all employees should be aware of potential electrical hazards, which include damaged power cords. If you notice a damaged cord, do not attempt to repair it yourself. Only qualified personnel should handle electrical equipment and perform installation repairs or maintenance in line with "Reporting Electrical Hazards, you should report the damaged cord immediately to your supervisor or the maintenance team. This reporting is crucial to prevent accidents and ensure a safe working environment. V\nRemember, safety is our top priority as per the Dark Store Safety Charter. Always stay vigilant and prioritize your safety and the safety of your colleagues in all electrical-related tasks

Query : aag kaise bujhaaye

Response :

Agar aapko aag bujhani hai, toh Blinkit Dark Stores Safety Training Video Script me diye gaye PASS technique ka use karein.

Sabse pehle, safety pin ko kheenchein taaki extinguisher unlock ho jaye. Duse, nozzle ko aag ke base par aim karein jahaan fuel source hai.

Teesre, lever ko dabayein taaki extinguishing agent release ho. Aakhri me, extinguisher ko side se side tak sweep karein taaki puri aag area cover ho jaye.

Agar aag aapke control ke bahar hai, toh turant evacuate ho jayein sabse najdik ke fire exit ke through aur alarm bajakar dusron ko alert karein. Jab aap ek safe location par pahuch jayein, to emergency ko store manager ko report karein

Applause GIFs - Get the best gif on GIFER

Concluding

I have built a search Platform where I can query over any set of Documents in human language and get the response in a very well-articulated manner. However there are tradeoffs to everything and here the tradeoff is time, it takes somewhere around 7-10 Seconds to generate a proper response.

So I intend to use it as a Smart Query Assistant which can serve n numbers of use cases in B2C and B2B industry

Simpsons Conclusion GIF - Simpsons Conclusion This - Discover & Share GIFs