Backend

Backend

Vectorizing Document Chunks

Vectorizing document chunks is a crucial step in preparing text for storage and retrieval in a vector database (vDB). This process involves converting text data into numerical vectors that efficiently represent the semantic meaning of the words or phrases. These vectors are then stored in the Qdrant vDB along with specific metadata to facilitate fast and accurate search and analysis capabilities.

Steps for Vectorizing Documents:

  1. Pre-processing: Clean the document chunks by removing stop words, punctuation, and any irrelevant characters. This step enhances the quality of the vectorization process.

  2. Tokenization: Split the cleaned text into individual words or tokens. This helps in transforming the text into a structured form.

  3. Embedding: Use an embedding technique such as Word2Vec, GloVe, or BERT to convert tokens into numerical vectors. Each vector captures the semantic meaning of the corresponding word or phrase.

    1. The particular embedding model can be configured through an environmental file.

    2. Currently, we are using Sentence-BERT on the website chatgpt.enclave.io with distiluse-base-multilingual-cased-v1 model

Qdrant: Storing in Vector Database (vDB):

After vectorization, the document vectors are saved in the vDB with specific metadata. The metadata include information such as the document ID that aid in document retrieval and management.

Metadata Structure:

  • Document ID: Unique identifier for each document chunk.

  • Vector: The numerical vector representing the document chunk.

Benefits:

  • Efficient Search: vDBs can perform fast similarity searches among vectors, enabling quick retrieval of relevant document chunks based on their semantic content.

  • Scalability: Vector databases are designed to handle large volumes of data, making them suitable for storing and querying extensive collections of document vectors.

  • Semantic Analysis: By analyzing the vectors, it's possible to gain insights into the themes, trends, and patterns within the text data.

Semantic search transforms the way we interact with data in ChatGPTFirewall by leveraging the nuanced understanding of language. Unlike traditional search methods that rely on keyword matching, semantic search utilizes the vectorized representations of document chunks to comprehend the query's intent and the context of the words. This approach allows for a more natural and intuitive search experience, enabling users to find relevant information even if the exact keywords aren't used in their query. Semantic search thus ensures that the results are not only based on the presence of specific terms but also on the overall meaning of the query, enhancing the efficiency and accuracy of information retrieval.

Further enhancing its utility, semantic search acts as a sophisticated filter that refines the scope of information and potential answers available for a user's question or query. By analyzing the semantic content of data stored in vector form, ChatGPTFirewall is able to narrow down the context before querying the large language model (LLM). This pre-selection process significantly reduces the amount of data sent to the LLM, making the retrieval of relevant information more efficient and contextually appropriate. This streamlined interaction between semantic search and the LLM not only improves response times but also enhances the overall user experience by delivering precise information that aligns with the user’s needs and intentions.

Named Entity Recognition

One of the core functionalities of ChatGPTFirewall is named entity recognition. This feature enables users to selectively mask personal data before transmitting it to platforms such as ChatGPT. Currently, we automatically identify and replace names and locations in text with pseudonymized entities.

For a more detailed explanation of this feature, please refer to Chapter Named Entity Recognition.

Integrate with Django for Cohesive Functionality

To consolidate these techniques, we employ Django along with the Django Rest Framework module. With Django, we establish a database schema to store workflow-specific data, facilitating semantic search on the vDB.

The User Model captures information about users, such as language settings, Auth0 data, and API calls. Users are associated with Rooms and Documents, linked through the roomDocuments table. The room table archives chat history, encompassing not only user data but also that of other participants, delineated by ChatGPT roles (system, user, and assistant).

An additional layer of complexity is introduced through the management of anonymized entities within each room. These anonymized entities serve as foundational elements for constructing a robust replacement map, ensuring consistency and integrity across the entire corpus of chat interactions. This meticulous approach not only enhances data privacy but also contributes to the overall coherence and reliability of our platform's functionalities.

ER-Diagram

class diagram

Last updated