๐Ÿ” RAGify under the Scope๏ƒ

As we already presented in the Introduction, RAGify is a powerful tool that enhances the way you interact with PDF documents. It combines the strengths of retrieval systems and generative models to provide more informed, accurate, and contextually relevant outputs.

Install Dependencies๏ƒ

  • Install first Ollama. server in your machine.

  • In a new cmd run the commands bellow to install some models.

ollama pull hf.co/nomic-ai/nomic-embed-text-v1.5-GGUF:F32
ollama pull llama3.2:3b
ollama pull llama3.1:8b
ollama pull qwen:7b
  • Then in a new Conda env or venv install some python libraries with :

pip install -r requirements.txt

Implimenation of RAGify๏ƒ

utilitis.py๏ƒ

Countain usefull functions for non-arabic files.

chromadb.api.client.SharedSystemClient.clear_system_cache()
# This line for avoiding some errors when using ChromaDB with Streamlit

URL = "http://localhost:11434"
persist_directory="CHROMA_2"


embed_model = OllamaEmbeddings(
    model="hf.co/nomic-ai/nomic-embed-text-v1.5-GGUF:F32",
    base_url=URL,
    show_progress =True
)

The code clears the system cache of the ChromaDB client and sets up the local server URL and persistence directory. It then initializes the OllamaEmbeddings model using the nomic-embed-text-v1.5-GGUF model for text embeddings, with progress display enabled.

Example GIF
  1. Extract_pdf_content Extracts text from all pages of a PDF file.

    def Extract_pdf_content(pdf_file):
        """
        Extracts the content of each page in a PDF file and returns a list of pages.
        """
        reader = PdfReader(pdf_file)
        pages = []
        for page in reader.pages:
            pages.append(page.extract_text())
        return pages
    

    Description:

    • Reads the PDF file and extracts text content from each page.

    • Returns a list of text strings, where each string corresponds to a page.

  2. Proccess_Files Reads and processes multiple PDF files, updating the progress in a Streamlit app.

    def Proccess_Files(Files):
        if Files :
            st.title("๐Ÿ“„ Reading Files ...")
            progress_percentage = 0
            Documents = []
    
            total_files = len(Files)
            progress_bar = st.progress(0)
    
            for file_index, file in enumerate(Files):
                Pages_Contents = Extract_pdf_content(file)
                file_name = file.name
                for index, Page in enumerate(Pages_Contents):
                    document = Document(
                        page_content=Page,
                        metadata={"source": file_name, "PageNum": index + 1}
                    )
                    Documents.append(document)
                progress_percentage = int(((file_index + 1) / total_files) * 100)
                progress_bar.progress(progress_percentage, text=f"{progress_percentage} %")
    
            if progress_percentage == 100:
                st.success("โœ… Files processing completed!")
                st.session_state['Documents'] = Documents
            return Documents
        return None
    

    Description:

    • Uses the Extract_pdf_content function to process PDFs.

    • Updates progress dynamically in a Streamlit UI.

    • Stores processed documents in st.session_state for later use.

  3. Chunking Splits document text into manageable chunks for processing.

    def Chunking(documents):
        if documents :
            st.title("โœ‚๏ธ Chunking documents ...")
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000,chunk_overlap=600)
            Chunks = text_splitter.split_documents(documents)
            st.write("#### Number of Chunks is :",len(Chunks))
            if Chunks :
                st.success("โœ… Chunking completed!")
                st.session_state['Chunks'] = Chunks
            return Chunks
        return None
    

    Description:

    • Uses RecursiveCharacterTextSplitter to divide text into smaller chunks of size 2000 with an overlap of 600 characters.

    • Displays progress and stores the chunks in st.session_state.

  4. Create_Database Creates a Chroma vector database from text chunks.

    def Create_Database(Chunks):
        if Chunks :
            st.title("๐Ÿ—„๏ธ Creating ChromaDB ...")
            vector_store = Chroma.from_documents(Chunks, embed_model, persist_directory=persist_directory)
            st.success("โœ… ChromaDB is ready!")
            st.session_state['Vector_store'] = vector_store
    

    Description:

    • Converts document chunks into vector representations using embeddings and stores them in ChromaDB.

    • Stores the vector database in st.session_state.

  5. Retrieve Retrieves the most relevant chunks for a given question.

    def Retrieve(Question):
        db = Chroma(persist_directory=persist_directory, embedding_function=embed_model)
        results = db.similarity_search_with_relevance_scores(Question, k=5)
        context_text = "\n\n---\n\n".join([chunk.page_content for chunk, _score in results])
        prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
        prompt = prompt_template.format(context=context_text, question=Question)
        return prompt, context_text
    

    Description:

    • Searches the ChromaDB for the top 5 relevant chunks for the input question.

    • Formats the results into a prompt template for the language model.

  6. Run_Pipeline Runs the retrieval and generation pipeline for a question.

    def Run_Pipeline(question, LLM_Name):
        prompt, _ = Retrieve(question)
        st.write("### ๐Ÿงพ Prompt")
        st.text_area(label="", value=prompt, height=200)
    
        llm = Ollama(model=LLM_Name, base_url=URL)
        response = llm.invoke(prompt)
        return response
    

    Description:

    • Combines the retrieval step with the LLM to generate answers for a user query.

    • Displays the generated prompt and retrieves the final response.

  7. RunLLM Runs the LLM directly with a user-provided question.

    def RunLLM(question, LLM_Name):
        llm = Ollama(model=LLM_Name, base_url=URL)
        response = llm.invoke(question)
        return response
    

    Description:

    • Directly queries the LLM without retrieval for a simpler use case.

utilitis1.py๏ƒ

For arabic files.

chromadb.api.client.SharedSystemClient.clear_system_cache()
# This line for avoiding some errors when using ChromaDB with Streamlit

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
URL = "http://localhost:11434"
persist_directory="CHROMA_Arabic"

This line initializes the embedding model using the HuggingFace paraphrase-multilingual-mpnet-base-v2 model for multilingual text embeddings, sets the local server URL, and defines the directory for storing the Chroma database.

  1. Extract_pdf_content_1 Extracts text from all pages of an Arabic PDF file.

    def Extract_pdf_content_1(pdf_file):
        """
        Extracts the content of each page in a PDF file and returns a list of pages.
        """
        reader = PdfReader(pdf_file)
        pages = []
        for page in reader.pages:
            pages.append(page.extract_text())
        return pages
    

    Description:

    • Reads the Arabic PDF file and extracts text content from each page.

    • Returns a list of strings, each representing the content of a single page.

  2. Proccess_Files_1 Processes multiple Arabic PDF files and tracks progress in Streamlit.

    def Proccess_Files_1(Files):
        if Files :
            st.title("๐Ÿ“„ ู‚ุฑุงุกุฉ ุงู„ู…ู„ูุงุช ...")
            progress_percentage = 0
            Documents = []
    
            total_files = len(Files)
            progress_bar = st.progress(0)
    
            for file_index, file in enumerate(Files):
                Pages_Contents = Extract_pdf_content_1(file)
                file_name = file.name
                for index, Page in enumerate(Pages_Contents):
                    document = Document(
                        page_content=Page,
                        metadata={"source": file_name, "PageNum": index + 1}
                    )
                    Documents.append(document)
                progress_percentage = int(((file_index + 1) / total_files) * 100)
                progress_bar.progress(progress_percentage, text=f"{progress_percentage} %")
    
            if progress_percentage == 100:
                st.success("โœ… ุชู… ุงู„ุงู†ุชู‡ุงุก ู…ู† ู…ุนุงู„ุฌุฉ ุงู„ู…ู„ูุงุช!")
                st.session_state['Documents_1'] = Documents
    
            print(Documents)
            return Documents
    

    Description:

    • Uses Extract_pdf_content_1 to extract text from each PDF.

    • Displays a progress bar and stores processed documents in st.session_state.

  3. Chunking_1 Splits Arabic document text into smaller chunks for better processing.

    def Chunking_1(documents):
        if documents :
            st.title("โœ‚๏ธ ุชู‚ุณูŠู… ุงู„ูˆุซุงุฆู‚ ...")
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=600)
            Chunks = text_splitter.split_documents(documents)
            st.write("#### Number of Chunks is :", len(Chunks))
            if Chunks :
                st.success("โœ… ุชู… ุชู‚ุณูŠู… ุงู„ูˆุซุงุฆู‚ ุจู†ุฌุงุญ!")
                st.session_state['Chunks_1'] = Chunks
            return Chunks
        return None
    

    Description:

    • Uses RecursiveCharacterTextSplitter to split the Arabic document text into chunks of size 2000 with an overlap of 600 characters.

    • Stores the chunks in st.session_state.

  4. Create_Database_1 Creates a Chroma vector database for Arabic document chunks.

    def Create_Database_1(Chunks):
        if Chunks :
           st.title("๐Ÿ—„๏ธ ุฅู†ุดุงุก ู‚ุงุนุฏุฉ ุจูŠุงู†ุงุช ChromaDB ...")
           vector_store = Chroma.from_documents(Chunks, embedding_model, persist_directory=persist_directory)
           st.success("โœ… ู‚ุงุนุฏุฉ ุจูŠุงู†ุงุช  ุฌุงู‡ุฒุฉ!")
           st.session_state['Vector_store_1'] = vector_store
    

    Description:

    • Converts document chunks into vector embeddings using the HuggingFaceEmbeddings model.

    • Stores these embeddings in a ChromaDB instance.

  5. Retrieve_1 Retrieves the most relevant Arabic text chunks for a given question.

    def Retrieve_1(Question):
        db = Chroma(persist_directory=persist_directory, embedding_function=embedding_model)
        results = db.similarity_search_with_relevance_scores(Question, k=5)
        context_text = "\n\n---\n\n".join([chunk.page_content for chunk, _score in results])
        prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
        prompt = prompt_template.format(context=context_text, question=Question)
        return prompt, context_text
    

    Description:

    • Searches the ChromaDB for the top 5 relevant chunks based on the input question.

    • Formats the results into a custom Arabic prompt template for further processing.

  6. Run_Pipeline_1 Runs the entire pipeline to retrieve and answer a question using an Arabic LLM.

    def Run_Pipeline_1(question, LLM_Name):
        prompt, _ = Retrieve_1(question)
        st.write("### ๐Ÿงพ ุงู„ุทู„ุจ")
        st.text_area(label="", value=prompt, height=200)
    
        llm = Ollama(model=LLM_Name, base_url=URL)
        response = llm.invoke(prompt)
        return response
    

    Description:

    • Combines the retrieval step with the LLM for generating responses to user queries.

    • Displays the generated prompt and retrieves the final response.

RAGify Demo Video๏ƒ

Here is a video of the RAGify pipeline in action: