Data Science: Set Similarity Metrics

In this article I would like to present several metrics to calculate the similarity between sets of items. I’ve been analyzing diverse metrics as part of an investigation I’m doing to improve the targeting of digital advertising campaigns. These set similarity metrics are very useful to address the problem of audience expansion.

The basic idea is: we start with a set of users that have engaged with the ad, for example clickers. Then we try to find other similar sets of users that we can target. We have the expectation that, because of their similarity, the targeted users will also engage positively with the ad. These similar users become our expanded audience.

Jaccard Similarity

The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of the sets.

Given two sets, A and B, the Jaccard Similarity is defined as:

The Jaccard Similarity ranges between zero and one.

Also called: Jaccard index, Intersection over Union

For more information: Jaccard Similarity

Sorensen Coefficient

The Sorensen Coefficient equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

Given two sets, X and Y, the Sorensen Coefficient is defined as:

The Sorensen Coefficient ranges between zero and one.

Also called: Sorensen–Dice index, Sorensen index, Dice’s coefficient

For more information: Sorensen Coefficient

Tversky Index

For sets X and Y, the Tversky Index is given by:

Note that  are parameters of the Tversky Index.

The Tversky Index ranges between zero and one.

The Tversky Index can be seen as a generalization of the Jaccard Similarity and the Sorensen Coefficient:

  • Setting  produces the Jaccard Similarity.
  • Setting  produces the Sorensen Coefficient.

Tversky measures with  are of special interest.

For more information: Tversky Index

Overlap Coefficient

The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.

For sets X and Y, the Overlap Coefficient is given by:

If set X is a subset of Y or the converse then the Overlap Coefficient is equal to 1.

Also called: Szymkiewicz–Simpson Coefficient

For more information: Overlap Coefficient

About Hayim Makabee

Veteran software developer, enthusiastic programmer, author of a book on Object-Oriented Programming, co-founder and CEO at KashKlik, an innovative Influencer Marketing platform.
This entry was posted in Data Science, Machine Learning and tagged , . Bookmark the permalink.

5 Responses to Data Science: Set Similarity Metrics

  1. vhen says:

    Not to be nitpicky, but for the overlap coefficient, when you write:
    “The Overlap Coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.”
    I think you mean:
    “The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.”

  2. Pingback: String Similarity Metrics: Token Methods | Baeldung on Computer Science

Leave a comment