## Data Science: Set Similarity Metrics

In this article I would like to present several metrics to calculate the similarity between sets of items. I’ve been analyzing diverse metrics as part of an investigation I’m doing to improve the targeting of digital advertising campaigns. These set similarity metrics are very useful to address the problem of audience expansion.

The basic idea is: we start with a set of users that have engaged with the ad, for example clickers. Then we try to find other similar sets of users that we can target. We have the expectation that, because of their similarity, the targeted users will also engage positively with the ad. These similar users become our expanded audience.

# Jaccard Similarity

The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of the sets.

Given two sets, A and B, the Jaccard Similarity is defined as: The Jaccard Similarity ranges between zero and one.

Also called: Jaccard index, Intersection over Union

# Sorensen Coefficient

The Sorensen Coefficient equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

Given two sets, X and Y, the Sorensen Coefficient is defined as: The Sorensen Coefficient ranges between zero and one.

Also called: Sorensen–Dice index, Sorensen index, Dice’s coefficient

# Tversky Index

For sets X and Y, the Tversky Index is given by: Note that are parameters of the Tversky Index.

The Tversky Index ranges between zero and one.

The Tversky Index can be seen as a generalization of the Jaccard Similarity and the Sorensen Coefficient:

• Setting produces the Jaccard Similarity.
• Setting produces the Sorensen Coefficient.

Tversky measures with are of special interest.

# Overlap Coefficient

The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.

For sets X and Y, the Overlap Coefficient is given by: If set X is a subset of Y or the converse then the Overlap Coefficient is equal to 1.

Also called: Szymkiewicz–Simpson Coefficient Veteran software developer, enthusiastic programmer, author of a book on Object-Oriented Programming, co-founder and CEO at KashKlik, an innovative Influencer Marketing platform.
This entry was posted in Data Science, Machine Learning and tagged , . Bookmark the permalink.

### 4 Responses to Data Science: Set Similarity Metrics

1. Lior Kogan says:
• Hayim Makabee says:

Thanks Lior, very nice!

2. vhen says:

Not to be nitpicky, but for the overlap coefficient, when you write:
“The Overlap Coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.”
I think you mean:
“The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.”

• Hayim Makabee says:

Thanks! I fixed it now. You are invited to read and comment on my other blog posts as well. 😉