In this article I would like to present several metrics to calculate the similarity between sets of items. I’ve been analyzing diverse metrics as part of an investigation I’m doing to improve the targeting of digital advertising campaigns. These set similarity metrics are very useful to address the problem of audience expansion.

The basic idea is: we start with a set of users that have engaged with the ad, for example clickers. Then we try to find other similar sets of users that we can target. We have the expectation that, because of their similarity, the targeted users will also engage positively with the ad. These similar users become our expanded audience.

Jaccard Similarity

The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of the sets.

Given two sets, A and B, the Jaccard Similarity is defined as:

The Jaccard Similarity ranges between zero and one.

Also called: Jaccard index, Intersection over Union

For more information: Jaccard Similarity

Sorensen Coefficient

The Sorensen Coefficient equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.

Given two sets, X and Y, the Sorensen Coefficient is defined as:

The Sorensen Coefficient ranges between zero and one.

Also called: Sorensen–Dice index, Sorensen index, Dice’s coefficient

For more information: Sorensen Coefficient

Tversky Index

For sets X and Y, the Tversky Index is given by:

$Note that$ $\text{[math]}$ are parameters of the Tversky Index.

The Tversky Index ranges between zero and one.

The Tversky Index can be seen as a generalization of the Jaccard Similarity and the Sorensen Coefficient:

Setting $\text{[math]}$ produces the Jaccard Similarity.
Setting $\text{[math]}$ produces the Sorensen Coefficient.

Tversky measures with $\text{[math]}$ are of special interest.

For more information: Tversky Index

Overlap Coefficient

The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.

For sets X and Y, the Overlap Coefficient is given by:

If set X is a subset of Y or the converse then the Overlap Coefficient is equal to 1.

Also called: Szymkiewicz–Simpson Coefficient

For more information: Overlap Coefficient

About Hayim Makabee

Veteran software developer, enthusiastic programmer, author of a book on Object-Oriented Programming, co-founder and CEO at KashKlik, an innovative Influencer Marketing platform.

View all posts by Hayim Makabee →

5 Responses to Data Science: Set Similarity Metrics

Lior Kogan says:

March 3, 2019 at 11:34 am

Nice!

Also Hamming Distance between finite sets:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.224.4799

https://stackoverflow.com/questions/29425742/calculate-the-hamming-distance-between-the-two-same-datasets/29428071#29428071

- Hayim Makabee says:
  
  March 3, 2019 at 11:47 am
  
  Thanks Lior, very nice!
  
vhen says:

March 22, 2019 at 8:39 am

Not to be nitpicky, but for the overlap coefficient, when you write:
“The Overlap Coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.”
I think you mean:
“The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.”

- Hayim Makabee says:
  
  March 27, 2019 at 11:51 am
  
  Thanks! I fixed it now. You are invited to read and comment on my other blog posts as well. 😉
  
Pingback: String Similarity Metrics: Token Methods | Baeldung on Computer Science

Data Science: Set Similarity Metrics

Jaccard Similarity

Sorensen Coefficient

Tversky Index

Overlap Coefficient

About Hayim Makabee

5 Responses to Data Science: Set Similarity Metrics

Leave a comment Cancel reply

About Me

Top Posts

Recent Posts

Categories

Tag Cloud

Archives

Blogroll

Blog Stats

Meta

Data Science: Set Similarity Metrics

Jaccard Similarity

Sorensen Coefficient

Tversky Index

Overlap Coefficient

Share this:

Related

About Hayim Makabee

5 Responses to Data Science: Set Similarity Metrics

Leave a comment Cancel reply

About Me

Top Posts

Recent Posts

Categories

Tag Cloud

Archives

Blogroll

Blog Stats

Meta