In this article I would like to present several metrics to calculate the similarity between sets of items. I’ve been analyzing diverse metrics as part of an investigation I’m doing to improve the targeting of digital advertising campaigns. These set similarity metrics are very useful to address the problem of audience expansion.
The basic idea is: we start with a set of users that have engaged with the ad, for example clickers. Then we try to find other similar sets of users that we can target. We have the expectation that, because of their similarity, the targeted users will also engage positively with the ad. These similar users become our expanded audience.
Jaccard Similarity
The Jaccard Similarity is defined as the size of the intersection divided by the size of the union of the sets.
Given two sets, A and B, the Jaccard Similarity is defined as:
The Jaccard Similarity ranges between zero and one.
Also called: Jaccard index, Intersection over Union
For more information: Jaccard Similarity
Sorensen Coefficient
The Sorensen Coefficient equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.
Given two sets, X and Y, the Sorensen Coefficient is defined as:
The Sorensen Coefficient ranges between zero and one.
Also called: Sorensen–Dice index, Sorensen index, Dice’s coefficient
For more information: Sorensen Coefficient
Tversky Index
For sets X and Y, the Tversky Index is given by:
Note that are parameters of the Tversky Index.
The Tversky Index ranges between zero and one.
The Tversky Index can be seen as a generalization of the Jaccard Similarity and the Sorensen Coefficient:
- Setting produces the Jaccard Similarity.
- Setting produces the Sorensen Coefficient.
Tversky measures with are of special interest.
For more information: Tversky Index
Overlap Coefficient
The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.
For sets X and Y, the Overlap Coefficient is given by:
If set X is a subset of Y or the converse then the Overlap Coefficient is equal to 1.
Also called: Szymkiewicz–Simpson Coefficient
For more information: Overlap Coefficient
Nice!
Also Hamming Distance between finite sets:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.224.4799
https://stackoverflow.com/questions/29425742/calculate-the-hamming-distance-between-the-two-same-datasets/29428071#29428071
Thanks Lior, very nice!
Not to be nitpicky, but for the overlap coefficient, when you write:
“The Overlap Coefficient is defined as the size of the intersection divided by the smaller of the size of the two sets.”
I think you mean:
“The Overlap Coefficient is defined as the size of the intersection divided by the size of the smaller of the two sets.”
Thanks! I fixed it now. You are invited to read and comment on my other blog posts as well. 😉
Pingback: String Similarity Metrics: Token Methods | Baeldung on Computer Science