Here I explore an issue with how sklearn’s paired_cosine_distances
function returns erronous values when we have one vector with zero norm.
import numpy as np
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances
from sklearn.feature_extraction.text import TfidfVectorizer
paired_cosine_distances(
np.array([[1, 1], [0, 1], [0, 0], [1, 0]]),
np.array([[1, 1], [0, 1], [0, 1], [0, 1]])
)
# Outputs: array([0. , 0. , 0.5, 1. ])
# dot products
(np.array([[1, 1], [0, 1], [0, 0], [1, 0]]) * np.array([[1, 1], [0, 1], [0, 1], [0, 1]])).sum(axis=-1)
# Outputs: array([2, 1, 0, 0])
Dot product between [0, 0] and [0, 1] is zero and hence cosine sim should also be zero (or at best undefined).
However, sklearn paried cosine dist will give a value of 0.5 for this case which is not realistic.
The r