In computer science, it is often important to compute distances to assess similarity or dissimilarity between two points. Conceptually speaking, we could wonder how close are two cities. That is easy because we’re used to computing physical distances in km. However, how would you assess the similarity between a man in his 30s and a diagnostic of diabetes versus a woman in her 50s and a diagnostic of kidney disease? There are many different methods that I will list over time.
Grower’s distance is a metric to assess the dissimilarities between two records. The beauty of this method is that it is able to combine different data types (binary, numerical, categorical). The distance ranges from 0 (minimum) and 1 (the maximum). The beauty of this method comes from the possibility of using mixed data types.
- Binary variables: The binary variables are assessed using the dice distance metric. That is that the outcome is always zero except if both values are 1.
- Categorical variables: Categorical variables are first one-hot encoded; hence, transformed to binary and then processed as above.
- Quantitative variables: If the values are numeric then the distance is computed between A and B, but it is devided by the mean of the variable using all values.
Finally, we then add all the distances for each of the features and divide it by the number of features to obtain the similarity. If we want the distance we do 1 minus similarity.