Embeddings are numerical representations of real-world objects, like text, images or audio. They are generated by machine learning models as vectors, an array of numbers, where the distance between vectors can be seens as the degree of similarity between objects. While an embedding model may have its own meaning for each of the dimensions, there’s no guarantee between embedding models of the meaning for each of the dimensions used by the embedding models.
For example, the word “cat”, “dog” and “apple” might be embedded into the following vectors:
cat -> (1, -1, 2)
dog -> (1.5, -1.5, 1.8)
apple -> (-1, 2, 0)
These vectors are made-up for a simpler example. Real vectors are much larger, see the Dimension section for details.
Visualizing these vectors as points in a 3D space, we can see that "cat" and "dog" are closer, while "apple" is positioned further away.
Figure 1. Vectors as points in a 3D space
By embedding words and contexts into vectors, we enable systems to assess how related two embedded items are to each other via vector comparison.
The dimension of embeddings refers to the length of the vector representing the object.
In the previous example, we embedded each word into a 3-dimensional vector. However, a 3-dimensional embedding inevitably leads to a massive loss of information. In reality, word embeddings typically require hundreds or thousands of dimensions to capture the nuances of language.
For example,
Figure 2. Using text-embedding-ada-002 to embed the sentence “I have a calico cat.”
In short, an embedding is a vector that represents a real-world object. The distance between these vectors indicates the similarity between the objects.
Embedding models are subject to a crucial limitation: the token limit, where a token can be a word, punctuation mark, or subword part. This constraint defines the maximum amount of text a model can process in a single input. For instance, the Amazon Titan Text Embeddings models can handle up to 8,192 tokens.
When input text exceeds the limit, the model typically truncates it, discarding the remaining information. This can lead to a loss of context and diminished embedding quality, as crucial details might be omitted.
To address this, several strategies can help mitigate its impact:
Vector databases are optimized for storing embeddings, enabling fast retrieval and similarity search. By calculating the similarity between the query vector and the other vectors in the database, the system returns the vectors with the highest similarity, indicating the most relevant content.
The following diagram illustrates a vector database search. A query vector 'favorite sport' is compared to a set of stored vectors, each representing a text phrase. The nearest neighbor, 'I like football', is returned as the top result.
Figure 3. Vector Query Example
Figure 4. Store Vectors into Database
Figure 5. Retrieve Vectors from Database
When working with vector databases, two key parameters come into play: Top K and similarity measure (or distance function).
When querying a vector database, the goal is often to retrieve the most similar items to a given query vector. This is where the Top K concept comes into play. Top K refers to retrieving the top K most similar items based on a similarity metric.
For instance, if you're building a product recommendation system, you might want to find the top 10 products similar to the one a user is currently viewing. In this case, K would be 10. The vector database would return the 10 product vectors closest to the query product's vector.
To determine the similarity between vectors, various distance metrics are employed, including:
Figure 6. Similarity Measures
There are many other similarity measures not listed here. The choice of distance metric depends on the specific application and the nature of the data. It is recommended to experiment with various similarity metrics to see which one produces better results.
As of October 2024, SnapLogic has supported embedders for major models and continues to expand its support. Supported embedders include:
Figure 7. Embed a File
Figure 8. Output of the Embedder Snap
Figure 9. Store a Vector into Database
Figure 10. A Vector in the Pinecone Database
Figure 11. Retrieve Vectors from a Database
[
{
"content" : "favorite sport"
}
]
Figure 12. Query Text
Figure 13. All Vectors in the Database
{
"matches": [
{
"id": "db873b4d-81d9-421c-9718-5a2c2bd9e720",
"score": 0.547461033,
"values": [],
"metadata": {
"content": "I like football."
}
}
]
}
Figure 14. Pipeline Output: the Closest Neighbor to the Query
Embedder and vector databases are widely used in applications such as Retrieval Augmented Generation (RAG) and building chat assistants.
While the focus thus far has been on text embeddings, the concept extends beyond words and sentences. Multimodal embeddings represent a powerful advancement, enabling the representation of various data types, such as images, audio, and video, within a unified vector space. By projecting different modalities into a shared semantic space, complex relationships and interactions between these data types can be explored.
For instance, an image of a cat and the word "cat" might be positioned closely together in a multimodal embedding space, reflecting their semantic similarity. This capability opens up a vast array of possibilities, including image search with text queries, video content understanding, and advanced recommendation systems that consider multiple data modalities.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.