How Vector Embeddings and Vector Databases works?
Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships.
The term "vector" just refers to an array of numbers with a specific dimensionality.
In the case of vector embeddings, these vectors represent any of the data points in a continuous space. Conversely, "embeddings" refers specifically to the technique of representing data as vectors in such a way that captures meaningful information, semantic relationships, or contextual characteristics.
Embeddings are designed to capture the underlying structure or properties of the data and are typically learned through training algorithms or models.
Types of vector embeddings
Word embeddings : Word to Vector : We use Libraries or Algorithms like : Word2Vec, GloVe, BERT, Fastext
Sentence embeddings : Sentence to Vector : Continuous Bag of Words (CBOW), Sequence-to-sequence models such as RNNs or LSTMs, SentenceTransformer
Document embeddings : Docuements to Vector : Libraries like Doc2Vec
Image embeddings : Techniques like convolutional neural networks (CNNs) and pre-trained models like ResNet and VGG generate image embeddings for tasks like image classification, object detection, and image similarity.
User embeddings : represent users in a system or platform as vectors. They capture user preferences, behaviors, and characteristics. User embeddings can be used in everything from recommendation systems to personalized marketing as well as user segmentation.
Product embeddings : represent products in ecommerce or recommendation systems as vectors. They capture a product’s attributes, features, and any other semantic information available. Algorithms can then use these embeddings to compare, recommend, and analyze products based on their vector representations.
How are vector embeddings created?
Vector embeddings are created through a machine learning process where a model is trained to convert any of the pieces of data into numerical vectors.
First, gather a large dataset that represents the type of data we want to create embeddings for, such as text or images.
Next, we will preprocess the data. This requires cleaning and preparing the data by removing noise, normalizing text, resizing images, or various other tasks depending on the type of data you are working with.
We will select a neural network model that is a good fit for the data goals and feed the preprocessed data into the model.
The model learns patterns and relationships within the data by adjusting its internal parameters during training. For example, it learns to associate words that often appear together or to recognize visual features in images.
As the model learns, it generates numerical vectors (or embeddings) that represent the meaning or characteristics of the data. Each data point, such as a word or an image, is represented by a unique vector.
What does vector embedding look like?
The length or dimensionality of the vector depends on the specific embedding technique we are using and how we want the data to be represented. For example, if we are creating word embeddings, they will often have dimensions ranging from a few hundred to a few thousand — something that is much too complex for humans to visually diagram.
Sentence or document embeddings may have higher dimensions because they capture even more complex semantic information.
The vector embedding itself is typically represented as a sequence of numbers, such as [0.2, 0.8, -0.4, 0.6, ...].
Each number in the sequence corresponds to a specific feature or dimension and contributes to the overall representation of the data point.
The actual numbers within the vector are not meaningful on their own. It is the relative values and relationships between the numbers that capture the semantic information and allow algorithms to process and analyze the data effectively.
Now let's understand Cosine Similarity
Let's try to understand this by an example :
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Word embeddings for two words
word1_embedding = np.array([0.2, 0.5, 0.8, 0.3]) ## Apple
word2_embedding = np.array([0.4, 0.1, 0.9, 0.5]) ## Orange
# Reshape the arrays to match the expected input shape of cosine_similarity
word1_embedding = word1_embedding.reshape(1, -1)
word2_embedding = word2_embedding.reshape(1, -1)
# Calculate cosine similarity
similarity = cosine_similarity(word1_embedding, word2_embedding)
A vector database is optimized for storing and querying vectors.
Vector Indexing for Approximate Nearest Neighbor Approach
Vector indexing is the process of organizing vector embeddings in a way that data can be retrieved efficiently.
Brute Force Approach :
When we want to find the closest items to our query vector, the brute force approach would be to use the k-Nearest Neighbors (kNN) algorithm.
But calculating the similarity between the query vector and every entry in the vector database can become computationally expensive if you have millions or even billions of data points because the required calculations increase linearly (O(n)) with the dimensionality and the number of data points.
What is Rather an Efficient Approach?
A more efficient solution to find similar objects is to use an approximate nearest neighbor (ANN) approach. The underlying idea is to pre-calculate the distances between the vector embeddings and organize and store similar vectors close to each other (e.g., in clusters or a graph), so that you can later find similar objects faster. This process is called vector indexing.
The speed gains are traded in for some accuracy because the ANN approach returns only the approximate results.
Example of Hierarchical Navigable Small World (HNSW) Algorithm
There are several ANN algorithms to index the vectors, which can be categorized into the following groups:
Clustering-based index (e.g., FAISS)
Proximity graph-based index (e.g., HNSW)
Tree-based index (e.g., ANNOY)
Hash-based index (e.g., LSH)
This is how Querying over Vector Databases work