How Do We Pick Our Embedding Model?

At Rememberizer.ai, our unwavering commitment to delivering accurate and efficient knowledge embedding services drove us to conduct a comprehensive evaluation of various vector embedding models. The goal was to identify the model that best aligns with our requirements and propels our capabilities to new heights.

Our Dataset

The evaluation process spanned two distinct datasets: Dataset A, a collection of AI-generated text unlikely to have been encountered during the training of existing embedding models, and Dataset B, a specialized corpus of US patents known for its complexity and domain-specific terminology. The inclusion of Dataset A allowed us to assess the models' performance on novel, synthetic data, while Dataset B provided a robust test for handling intricate, technical language.

‍

Examples:

Query: penguins, hamsters, teapots, Belarus, chaos

Target document:

```

In a distant land where penguins governed hamsters and teapots dictated foreign policy, Belarus was a silent juggler, dancing amidst the chaos of tangled alliances. Penguins flapped uselessly, debating lunch at 1997 o'clock—"Should it be cappuccino or cooperation camouflaged as coffee beans?"

"Dandelions don't dream of NATO," chimed the faucet gurgling in existential protest. Still, the teapot waffled, keens spatial coordinates precisely predicated on paradox. The cosmic ballet shoes refueling ladders for translucent gizmos tasked with peace or maybe hiccups. Yet altercations ensued when metaphysical hamsters, fuel potent and post-polar, cast suspicions on hedgehogs posturing for electoral profits. Belarus hid their guitar pick, awaiting the irrelevant term zero—which, said clinking primate relics triturating cognitive chaff, is wonderful bubbles manifest.

```

Query: An electrical switched-mode power converter and its operating method.

Target document:

```

Publication Number: 20240146201

Invention Title: AN ELECTRICAL SWITCHED MODE POWER CONVERTER AND OPERATIVE PROCEDURE THEREOF

Abstract: An electrical switched-mode power converter (

Applicant: Differential Power, SL

Inventors:

- Cobos Marquez, Jose Antonio

```

‍

Benchmarking Process

For each embedding model under evaluation, we embedded both the data and the search queries. We then calculated the recall@k metric, with k ranging from 1 to 10. This approach allowed us to assess the models' ability to retrieve relevant results within the top k search results, a crucial factor in delivering accurate and efficient knowledge embedding services.

Our experiments were conducted in a controlled environment to ensure consistent and reliable results. We utilized float point 16 precision for all models, leveraging the computational power of an NVIDIA GeForce RTX 4070 GPU. The models themselves were sourced from the Hugging Face repository, a widely recognized and trusted platform for state-of-the-art natural language processing models.

‍

Evaluating the results

The charts below show the Recall@K metric for several models on each dataset.

In this context, an embedding model converts text data into a numerical representation in a high-dimensional space such that similar pieces of text are close to each other. To evaluate the quality of these embeddings, we often need to check how well the model can retrieve relevant texts from a dataset based on their embeddings.

Here's how Recall@K works in this setup:

Embedding Generation: Each piece of text in the dataset is converted into an embedding using the model.
Query and Retrieval: For a given query text, its embedding is calculated. The system then retrieves the top K most similar text items from the dataset based on their embeddings.
Relevance Check: The retrieved items are checked against a ground truth to see how many of them are actually relevant to the query.
Recall Calculation: Recall@K is then computed as the number of relevant items retrieved within the top K results divided by the total number of relevant items in the dataset.

For example, suppose we have a dataset where each piece of text has known relevant counterparts. If for a particular query text, there are 10 relevant texts in the dataset and the model retrieves 3 relevant texts within the top 5 results (K=5), the Recall@5 would be 3/10 = 0.3 or 30%.

This metric helps in understanding how well the embedding model captures the semantic meaning of the text and places similar texts close to each other in the embedding space. A high Recall@K indicates that the model is effective in embedding the text such that relevant items are easily retrievable within the top K results. This can be particularly useful in applications like document retrieval, question answering, and recommendation systems, where finding relevant text quickly is crucial.

Title: Recall@k result for the AI-generated dataset

‍

Title: Recall@k result for the US patents dataset

‍

To maintain a focus on models with practical applicability, we filtered out those with very low recall values, as recall is a crucial metric for ensuring accurate knowledge embedding. The remaining models were then evaluated within a zoomed-in recall range of 0.5 to 1 on the y-axis, allowing for a more granular comparison.

Throughout this process, one model consistently stood out: intfloat/e5-large-v2 from Microsoft. This model demonstrated superior performance across both datasets, outperforming our current models and delivering results on par with industry-leading models from OpenAI. Its ability to handle diverse and complex datasets, including the novel AI-generated text in Dataset A, with precision and efficiency is a testament to its robustness and potential for enhancing our knowledge embedding capabilities.

The chart illustrates the recall performance of the evaluated models, with the standout model emerging as a clear frontrunner. Its strong performance on Dataset A highlights its adaptability to unseen data, a critical factor in our ever-evolving landscape of knowledge management.

While quantitative metrics are essential, we also considered the real-world implications of adopting this top-performing model. Its superior performance translates to improved accuracy and efficiency in our knowledge embedding service, enabling us to deliver more valuable insights to our users, even when dealing with novel or synthetic data.

We are excited to integrate the standout model into our system and anticipate significant improvements in our ability to transform unstructured data into structured insights, regardless of its origin or complexity. This decision represents a milestone in our ongoing pursuit of excellence and our commitment to leveraging cutting-edge technology to provide top-tier knowledge management solutions.

As we embark on this new chapter with the top-performing model, we invite you to join us on this journey of innovation and discovery. Stay tuned for updates as we continue to push the boundaries of what's possible in AI-driven knowledge management, even in the face of novel and challenging data.

‍