EduAss: Your Personal GenAI Search Engine for Local Files

Powered by Langchain, Qdrant, and Gemini

·

11 min read

Featured on Hashnode
EduAss: Your Personal GenAI Search Engine for Local Files

In a world overflowing with information, finding the precise knowledge you need within your own documents can feel like an impossible task. While Large Language Models (LLMs) like ChatGPT have revolutionized information access, they often stumble when asked to process and understand our local files.

This is where EduAss, a local file search engine powered by the cutting-edge Gemini API, steps in to empower your quest for knowledge. EduAss is built upon the principles of Retrieval Augmented Generation (RAG), a powerful approach that combines the strengths of information retrieval and generative AI.

How RAG Works.

  1. Knowledge Base. Your valuable documents (PDFs, text files, Word documents, and even PowerPoint presentations) form the core knowledge base.

  2. Indexing & Embedding. EduAss intelligently indexes your documents, breaking them down into smaller chunks and transforming them into numerical representations called "embeddings." These embeddings capture the semantic meaning of your text.

  3. Query Understanding. When you ask a question, EduAss uses the same embedding techniques to understand the meaning and intent behind your query.

  4. Relevant Retrieval. The system then compares your question's embedding to the document embeddings, pinpointing the most relevant information within your files.

  5. Generative AI Magic. Finally, EduAss leverages the Gemini API to synthesize a concise and accurate answer, drawing directly from the retrieved information and providing clear citations back to the source documents.

Imagine effortlessly querying your research papers, textbooks, or project reports using natural language and receiving precise answers with references to the source document. No more endless scrolling or keyword stuffing - EduAss brings the power of semantic search right to your digital doorstep.

Implementation Details

EduAss leverages a potent combination of technologies:

  1. Gemini API. Google's powerful text generation API forms the backbone of our question-answering system.

  2. Langchain. This framework simplifies the interaction with LLMs and streamlines the development process.

  3. Qdrant. A high-performance vector database, Qdrant allows us to store and query document embeddings efficiently.

  4. Semantic Indexing. By converting documents into numerical representations (embeddings), EduAss enables semantic search, understanding the meaning behind your queries.

Let's dive into the code and break down the implementation step-by-step.

1. Folder Structure.

We begin by creating directories to store indexed documents and vector embeddings.

import os

def create_directories():
    documents_index_path = os.path.join(os.path.expanduser("~"), "documents_index")
    local_qdrant_path = os.path.join(os.path.expanduser("~"), "local_qdrant")

    if not os.path.exists(documents_index_path):
        os.makedirs(documents_index_path)
    if not os.path.exists(local_qdrant_path):
        os.makedirs(local_qdrant_path)

    return documents_index_path, local_qdrant_path

def create_env_file():
    # Get the parent directory of the current script
    parent_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))

    env_file_path = os.path.join(parent_dir, ".env")

    if not os.path.exists(env_file_path):
        with open(env_file_path, 'w') as f:
            pass

    return env_file_path

This code snippet creates two folders: "documents_index" to store uploaded documents and "local_qdrant" to house the Qdrant database.

2. Document Indexing

Next, we'll index our documents. This process involves:

  • File Handling. Reading and processing various file formats like PDF, TXT, DOCX, and PPTX.

  • Text Splitting. Breaking down large documents into smaller chunks for efficient embedding generation.

  • Embedding Generation. Transforming text chunks into numerical vectors using Google's GoogleGenerativeAIEmbeddings.

  • Storing Embeddings. Saving the embeddings along with metadata (like file path) in the Qdrant database.

import PyPDF2
from os import listdir
from os.path. import isfile, join,isdir
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_qdrant import Qdrant
import sys
from langchain_text_splitters import TokenTextSplitter
from pptx import Presentation
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import docx
from dotenv import load_dotenv
import os

load_dotenv()

def get_files(dir):
    file_list = []
    for f in listdir(dir):
        if isfile(join(dir,f)):
            file_list.append(join(dir,f))
        elif isdir(join(dir,f)):
            file_list= file_list + get_files(join(dir,f))
    return file_list

def getTextFromWord(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

def getTextFromPPTX(filename):
    prs = Presentation(filename)
    fullText = []
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                fullText.append(shape.text)
    return '\n'.join(fullText)

class Document:
    def __init__(self, page_content, metadata):
        self.page_content = page_content
        self.metadata = metadata

def main_indexing(mypath):
    model_name = "models/embedding-001"
    hf = GoogleGenerativeAIEmbeddings(
        model=model_name
    )
    onlyfiles = get_files(mypath)
    file_content = ""
    qdrant = None

    local_qdrant_path = os.path.join(os.path.expanduser("~"), "local_qdrant")

    for file in onlyfiles:
        file_content = ""
        if file.endswith(".pdf"):
            print("indexing "+file)
            reader = PyPDF2.PdfReader(file)
            for i in range(0,len(reader.pages)):
                file_content = file_content + " "+reader.pages[i].extract_text()
        elif file.endswith(".txt"):
            print("indexing " + file)
            with open(file,'r') as f:
                file_content = f.read()
        elif file.endswith(".docx"):
            print("indexing " + file)
            file_content = getTextFromWord(file)
        elif file.endswith(".pptx"):
            print("indexing " + file)
            file_content = getTextFromPPTX(file)
        else:
            continue
        text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)
        texts = text_splitter.split_text(file_content)
        metadata = {"path":file}
        documents = [Document(text,metadata) for text in texts]
        collection_name = "MyCollection"
        if qdrant is None:
            qdrant = Qdrant.from_documents(
                                    documents,
                                    hf,
                                    path= local_qdrant_path,
                                    collection_name=collection_name
                                    )
        else:
            qdrant.add_documents(documents)
    print(onlyfiles)
    print("Finished indexing!")

This comprehensive code snippet handles file processing, text splitting, embedding generation, and storage in the Qdrant database.

3. Retrieval and Answer Generation

When you pose a question, EduAss springs into action:

  • Query Embedding: Your question is converted into a numerical vector using the same GoogleGenerativeAIEmbeddings model.

  • Similarity Search: Qdrant efficiently retrieves the most relevant document chunks based on the similarity between your question's embedding and the stored document embeddings.

  • Answer Synthesis: The retrieved chunks and your question are fed to the Gemini API, which crafts a concise and informative answer, complete with references to the source documents.

import qdrant_client
from langchain_qdrant import Qdrant
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

def search(query):
    model_name = "models/embedding-001"
    hf = GoogleGenerativeAIEmbeddings(model=model_name)
    client = qdrant_client.QdrantClient(path="/home/osen/local_qdrant/")
    collection_name = "MyCollection"
    qdrant = Qdrant(client, collection_name, hf)

    found_docs = qdrant.similarity_search(query=query, k=10)
    i = 0
    list_res = []
    for res in found_docs:
        list_res.append({"id":i,"path":res.metadata.get("path"),"content":res.page_content})
    return list_res

def retrieve_and_answer(query):
    output_parser = StrOutputParser()

    model_name = "models/embedding-001"
    hf = GoogleGenerativeAIEmbeddings(model=model_name)
    client = qdrant_client.QdrantClient(path="/home/osen/local_qdrant/")
    collection_name = "MyCollection"
    qdrant = Qdrant(client, collection_name, hf)

    found_docs = qdrant.similarity_search(query=query, k=10)

    i = 0
    list_res = []
    context = ""
    mappings = {}
    for res in found_docs:
        context = context + str(i)+"\n"+res.page_content+"\n\n"
        mappings[i] = res.metadata.get("path")
        list_res.append({"id":i,"path":res.metadata.get("path"),"content":res.page_content})
        i = i +1

    model = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0, convert_system_message_to_human=True)
    prompt = """Answer the user’s question using the documents given in the context. In the context are documents that should contain an answer. Please always reference the document ID (in square brackets, for example [0],[1]) of the document that was used to make a claim. Use as many citations and documents as it is necessary to answer a question.
            'Documents:\n{context}\n\nQuestion: {query}'"""

    prompt = ChatPromptTemplate.from_template(template=prompt)
    chain = (
        RunnablePassthrough()
        | prompt
        | model
        | output_parser
    )
    results = chain.invoke( {"context":context, "query":query})
    return results,list_res

This code showcases how EduAss retrieves relevant information and uses the Gemini API to generate insightful answers with proper citations.

4. API Endpoint

To make EduAss accessible, we create a simple API using FastAPI:

from fastapi import FastAPI
from pydantic import BaseModel
from retriver import retrieve_and_answer,search

app = FastAPI()

class Query(BaseModel):
    query: str

@app.post("/answer")
async def answer_query(query: Query):
    results,list_res = retrieve_and_answer(query.query)
    return {"answer": results,"context":list_res}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="localhost", port=8000)

This API endpoint receives your questions and returns the generated answer alongside relevant document context.

5. User Interface

Finally, a user-friendly Streamlit interface ties it all together:

import re
import streamlit as st
import requests
import json
import os   
from indexing1 import main_indexing as main_index

def list_document_titles(documents_index_path):
    files = [f for f in os.listdir(documents_index_path) if os.path.isfile(os.path.join(documents_index_path, f))]
    titles = [os.path.splitext(file)[0] for file in files] 
    return titles

st.title('_:blue[EDUASS Local GenAI Search]_ :sunglasses:')

st.sidebar.title("Configuration")
google_api_key = st.sidebar.text_input("Gemini API Key", "", type="password")
if st.sidebar.button("Submit"):
    with open('.env', 'w') as f:
        f.write(f"GOOGLE_API_KEY={google_api_key}")

uploaded_file = st.sidebar.file_uploader("Upload Documents", type=['txt', 'pdf', 'docx'])

documents_index_path = os.path.join(os.path.expanduser("~"), "documents_index")

if uploaded_file is not None:
    with open(os.path.join(documents_index_path, uploaded_file.name), 'wb') as f:
        f.write(uploaded_file.getbuffer())
    st.sidebar.success('File \'{}\' uploaded successfully.'.format(uploaded_file.name))

    if st.sidebar.button("Index Documents"):
        st.sidebar.text("Indexing in progress...")
        main_index(documents_index_path)
        st.sidebar.text("Indexing completed.")

document_titles = list_document_titles(documents_index_path)
for title in document_titles:
    st.sidebar.text(title)   

question = st.text_input("Ask a question based on your local files", "")
if st.button("Ask a question"):
    st.write("The current question is \"", question+"\"")
    url = "http://localhost:8000/answer"

    payload = json.dumps({
      "query": question
    })
    headers = {
      'Accept': 'application/json',
      'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)

    answer = json.loads(response.text)["answer"]
    rege = re.compile("\[Document\ [0-9]+\]|\[[0-9]+\]")
    m = rege.findall(answer)
    num = []
    for n in m:
        num = num + [int(s) for s in re.findall(r'\b\d+\b', n)]

    st.markdown(answer)
    documents = json.loads(response.text)['context']
    show_docs = []
    for n in num:
        for doc in documents:
            if int(doc['id']) == n:
                show_docs.append(doc)
    a = 1244
    for doc in show_docs:
        with st.expander(str(doc['id'])+" - "+doc['path']):
            st.write(doc['content'])
            with open(doc['path'], 'rb') as f:
                st.download_button("Download file", f, file_name=doc['path'].split('/')[-1],key=a)
                a = a + 1

This user interface allows you to upload files, trigger indexing, ask questions, and view answers alongside their source documents.

6. Automation Script

The following script streamlines the setup and execution of EduAss.

import os
import subprocess
import sys
from create_folders import create_directories

def setup_application():
    current_dir = os.path.dirname(os.path.realpath(__file__))
    documents_index_path, local_qdrant_path = create_directories()

    venv_dir = os.path.join(current_dir, ".venv")
    subprocess.run([sys.executable, "-m", "venv", venv_dir])

    pip_path = os.path.join(venv_dir, "bin", "pip" if os.name != 'nt' else "Scripts\\pip.exe")
    requirements_path = os.path.join(current_dir, "requirements.txt")
    subprocess.run([pip_path, "install", "-r", requirements_path])

    python_path = os.path.join(venv_dir, "bin", "python" if os.name != 'nt' else "Scripts\\python.exe")
    api_path = os.path.join(current_dir, "main.py")
    api_process = subprocess.Popen([python_path, api_path])

    streamlit_path = os.path.join(venv_dir, "bin", "streamlit" if os.name != 'nt' else "Scripts\\streamlit.exe")
    ui_path = os.path.join(current_dir, "user_interface.py")
    if os.name == 'nt':
        streamlit_process = subprocess.Popen(["start", "cmd", "/k", streamlit_path, "run", ui_path], shell=True)
    else:
        streamlit_process = subprocess.Popen(["gnome-terminal", "--", streamlit_path, "run", ui_path])

    api_process.wait()
    streamlit_process.wait()

if __name__ == "__main__":
    setup_application()

This script takes care of virtual environment setup, dependency installation, and running both the API and Streamlit application.

Setup & Usage.

link to the project on github is here.

  1. Clone the repository. git clone https://github.com/Osen761/Eduass.git

  2. Navigate to the project directory. cd eduass

  3. Install dependencies. python3 setup.py

  4. Obtain a Gemini API key from Google AI Studio.

  5. Run the application. python3 setup.py

  6. Access the Streamlit UI in your web browser.

  7. Upload your documents.click "Index Documents" and start asking questions!

Use Case. EduAss in Action

EduAss's ability to unlock insights from your local files opens up a world of possibilities across various domains. Here are just a few examples;

1. Education

  • Students.

    • Problem. Struggling to find specific information within a mountain of research papers for a term paper.

    • EduAss Solution. Upload all your papers and ask, "What are the main arguments for and against using AI in education?" EduAss will provide a concise summary, citing the relevant papers.

  • Researchers.

    • Problem. Sifting through years of academic publications to identify trends in a specific research area.

    • EduAss Solution. Upload your literature database and ask, "How has the use of virtual reality in medical training evolved over the past decade?" EduAss will pinpoint key papers and highlight significant developments.

2. Business

  • Marketing Teams.

    • Problem. Need to quickly analyze customer feedback scattered across multiple reports and presentations.

    • EduAss Solution. Upload all relevant files and ask, "What are the most common complaints about our latest product launch?" EduAss will summarize the feedback, identifying areas for improvement.

  • Sales Professionals.

    • Problem. Preparing for a client meeting and needing to quickly extract relevant information from past proposals and contracts.

    • EduAss Solution. Upload the client's folder and ask, "What were the key deliverables and timelines from our last project with Company X?" EduAss will provide the key details, ensuring a well-prepared meeting.

3. Healthcare

  • Doctors and Clinicians.

    • Problem. Accessing a patient's complete medical history, including physician notes and test results, to make informed decisions.

    • EduAss Solution. With proper security and privacy measures in place, EduAss can allow authorized healthcare professionals to query a patient's file. For example, "What were the results of the patient's last three cholesterol tests?"

  • Medical Researchers.

    • Problem. Analyzing clinical trial data and research papers to identify potential side effects of a new drug.

    • EduAss Solution. Upload all relevant data and ask, "Were there any reported cases of insomnia in patients who received Drug Y?" EduAss will quickly pinpoint and summarize any mentions of insomnia in the data.

4. Legal

  • Lawyers.

    • Problem. Reviewing thousands of pages of legal documents to prepare for a court case.

    • EduAss Solution. Upload the case files and ask, "Did the defendant have any prior knowledge of the incident?" EduAss can highlight relevant clauses, precedents, and testimonies that support or refute the claim.

  • Legal Researchers.

    • Problem. Searching for specific legal precedents and rulings across a vast library of legal texts.

    • EduAss Solution. Upload legal databases and ask, "Are there any precedents for using the 'fair use' doctrine in cases involving AI-generated content?" EduAss can swiftly provide relevant case citations and summaries.

5. Personal Productivity

  • Writers & Researchers.

    • Problem. Managing and referencing a large collection of notes, articles, and research materials.

    • EduAss Solution. Upload your entire research library and ask targeted questions, like "Find all the quotes I saved about artificial intelligence and ethics."

These are just a few examples of how EduAss can transform the way we interact with our local information. Its potential applications are vast and adaptable to countless industries and individual needs.

Conclusion

EduAss puts the power of GenAI-powered search directly in your hands, enabling you to effortlessly navigate and extract knowledge from your personal document library. While developed and rigorously tested on Ubuntu, EduAss is designed to be cross-platform and should work seamlessly on other operating systems. However, your feedback is invaluable! We encourage you to try EduAss on your preferred system and share your experience.

Let's collaborate to make this tool even better! Join the project on GitHub and contribute your ideas, code, or bug reports.

This is just the beginning! We are passionate about pushing the boundaries of AI-powered search and will be releasing new projects every week. Follow us and subscribe to stay updated on our latest innovations. Let's unlock the full potential of information, together!