Skip to main content
Version: User Guides (Cloud)

Similarity Metrics Explained
BETA

Similarity metrics are used to measure similarities among vectors. Choosing an appropriate distance metric helps improve classification and clustering performance significantly.

Currently, Zilliz Cloud supports these types of similarity Metrics: Euclidean distance (L2), Inner product (IP), Cosine similarity (COSINE), JACCARD (Beta), and HAMMING (Beta).

The table below summarizes the mapping between different field types and their corresponding metric types.

Field Type

Dimension Range

Supported Metric Types

Default Metric Type

FLOAT_VECTOR

2-32,768

Cosine, L2, IP

Cosine

FLOAT16_VECTOR (Beta)

2-32,768

Cosine, L2, IP

Cosine

BFLOAT16_VECTOR (Beta)

2-32,768

Cosine, L2, IP

Cosine

SPARSE_FLOAT_VECTOR (Beta)

No need to specify the dimension.

IP

IP

BINARY_VECTOR (Beta)

8-32,768*8

HAMMING (Beta), JACCARD (Beta)

HAMMING (Beta)

📘For vector fields of the `BINARY_VECTOR` type,
  • The dimension value (dim) must be a multiple of 8.

  • The available metric types are HAMMING and JACCARD.

Euclidean distance (L2)

Essentially, Euclidean distance measures the length of a segment that connects 2 points.

The formula for Euclidean distance is as follows:

HtxnbRRwhoXGX8xYTJPcUS0KnCh

where a = (a0, a1,..., an-1) and b = (b0, b1,..., bn-1) are two points in n-dimensional Euclidean space.

It's the most commonly used distance metric and is very useful when the data are continuous.

📘Notes

Zilliz Cloud only calculates the value before applying the square root when Euclidean distance is chosen as the distance metric.

Inner product (IP)

The IP distance between two embeddings is defined as follows:

X2NebUU5OogaZGxl7lwcrZhgnKg

IP is more useful if you need to compare non-normalized data or when you care about magnitude and angle.

📘Notes

If you use IP to calculate similarities between embeddings, you must normalize your embeddings. After normalization, the inner product equals cosine similarity.

Suppose X' is normalized from embedding X:

CQOTbuIS6oqcp8xmIK6csUJdnbh

The correlation between the two embeddings is as follows:

UgebbKhefovb1KxFJfmcg1w1nng

Cosine similarity

Cosine similarity uses the cosine of the angle between two sets of vectors to measure how similar they are. You can think of the two sets of vectors as line segments starting from the same point, such as [0,0,...], but pointing in different directions.

To calculate the cosine similarity between two sets of vectors A = (a0, a1,..., an-1) and B = (b0, b1,..., bn-1), use the following formula:

WkObbyuhJoiG1IxJY59co6M5nag

The cosine similarity is always in the interval [-1, 1]. For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. The larger the cosine, the smaller the angle between the two vectors, indicating that these two vectors are more similar to each other.

By subtracting their cosine similarity from 1, you can get the cosine distance between two vectors.

📘Notes

This is currently in beta. Upgrade your cluster to the beta version to utilize this new similarity metric.

JACCARD distance (Beta)

JACCARD similarity coefficient measures the similarity between two sample sets and is defined as the cardinality of the intersection of the defined sets divided by the cardinality of the union of them. It can only be applied to finite sample sets.

YwyibzK9poGlkJxaddocu3lynJb

JACCARD distance measures the dissimilarity between data sets and is obtained by subtracting the JACCARD similarity coefficient from 1. For binary variables, JACCARD distance is equivalent to the Tanimoto coefficient.

H1aSbG70JorMGXxtXfTcRIO3nng

📘Notes

Currently, this similarity metric is available exclusively for Dedicated clusters that have been upgraded to the Beta version.

HAMMING distance (Beta)

HAMMING distance measures binary data strings. The distance between two strings of equal length is the number of bit positions at which the bits are different.

For example, suppose there are two strings, 1101 1001 and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the HAMMING distance, d (11011001, 10011101) = 2.

📘Notes

Currently, this similarity metric is available exclusively for Dedicated clusters that have been upgraded to the Beta version.