In this article I would like to present several metrics to calculate the similarity between sets of items. I’ve been analyzing diverse metrics as part of an investigation I’m doing to improve the targeting of digital advertising campaigns. These set similarity metrics are very useful to address the problem of audience expansion.

The basic idea is: we start with a set of users that have engaged with the ad, for example clickers. Then we try to find other similar sets of users that we can target. We have the expectation that, because of their similarity, the targeted users will also engage positively with the ad. These similar users become our expanded audience.

# Jaccard Similarity

The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of the sets.

Given two sets, A and B, the Jaccard Similarity is defined as:

The Jaccard Similarity ranges between zero and one.

Also called: Jaccard index, Intersection over Union

For more information: Jaccard Similarity

# Sorensen Coefficient

The Sorensen Coefficient equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

Given two sets, X and Y, the Sorensen Coefficient is defined as:

The Sorensen Coefficient ranges between zero and one.

Also called: Sorensen–Dice index, Sorensen index, Dice’s coefficient

For more information: Sorensen Coefficient

# Tversky Index

For sets X and Y, the Tversky Index is given by:

Note that are parameters of the Tversky Index.

The Tversky Index ranges between zero and one.

The Tversky Index can be seen as a generalization of the Jaccard Similarity and the Sorensen Coefficient:

- Setting produces the Jaccard Similarity.
- Setting produces the Sorensen Coefficient.

Tversky measures with are of special interest.

For more information: Tversky Index

# Overlap Coefficient

The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.

For sets X and Y, the Overlap Coefficient is given by:

If set X is a subset of Y or the converse then the Overlap Coefficient is equal to 1.

Also called: Szymkiewicz–Simpson Coefficient

For more information: Overlap Coefficient

Nice!

Also Hamming Distance between finite sets:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.224.4799

https://stackoverflow.com/questions/29425742/calculate-the-hamming-distance-between-the-two-same-datasets/29428071#29428071

Thanks Lior, very nice!

Not to be nitpicky, but for the overlap coefficient, when you write:

“The Overlap Coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.”

I think you mean:

“The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.”

Thanks! I fixed it now. You are invited to read and comment on my other blog posts as well. 😉

Pingback: String Similarity Metrics: Token Methods | Baeldung on Computer Science