This is a crosspost from this question on stackoverflow. It received no answers because "it is a math question". I don't know how to move so I crosspost.
There are so many questions about distance distributions and the curse of dimensionality I did not read them all. But those that I did read, did not answer my question.
I want to create the distance distribution of uniformly distributed points in different dimensions, to ultimately visualize the curse of dimensionality. My understanding is that the distribution goes from a flatter curve in low dimensions to a spike curve in higher dimensions. They look like this:
Mean increases, variance decreases. Don't mind numbers, this is just a sketch (with gaussian distribution)
I have seen corresponding figures in papers. (Would be able to show and cite, but not sure about copyrights.) I want to recreate those figures. But when I create random points calculate the distances and put them in to a histgram it looks like this:
The first one is the plain histogram of distances. The second one, the bin are scaled to the max distance of the repective dimensions. In the third one I overlayed them by subtracting the value of the lowest non-empty bin. The first one obviously has the means, but neither increasing height nor decreasing variance. In the second you can see the concentration of the distances, but not height or means. This is not what I saw in those papers. The third indicates that the curves have all the same shape. Again, not what I expected. This is the code I used to create the second figure.
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
if name == 'main':
records = 1000
binwidth = 0.01
precision = 10
fig, ((ax1), (ax2), (ax3)) = plt.subplots(3, 1)
fig.set_size_inches(6, 10)
for dimensions in [2, 10, 100, 1000]:
data = np.random.uniform(low=0, high=1, size=(records, dimensions))
data = np.around(data, decimals=precision)
distlist = distance.pdist(data)
variance = np.var(distlist)
print('dim {dim}: var={var}'.format(dim=dimensions, var=variance))
maxdist = math.pow(dimensions, 1 / 2)
bins = [round(x * binwidth, 2) for x in range(round(maxdist / binwidth) + 2)]
histo = np.histogram(distlist, bins)
harr = np.stack((histo[1][1:], histo[0]), axis=-1)
mask = harr[:, 1] > 0
masked = harr[mask]
ax1.plot(masked[:, 0], masked[:, 1], label='dim {dim}'.format(dim=dimensions))
ax2.plot(masked[:, 0] / maxdist, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
min = np.min(masked[:, 0])
ax3.plot(masked[:, 0] - min, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
ax1.set_title('histogram')
ax1.legend()
ax2.set_title('histogram, bins divided by max distance')
ax2.legend()
ax3.set_title('histogram, bin subtracted by minimum bin')
ax3.legend()
plt.show()
I see two points where I may be wrong.
- My understanding of distance distributions
- the code does not what it's intended to do
How do I draw a histogram with bins of width 0.01 showing the number of distances between pairs of uniformly distributed points in the unit cube?
EDIT Ok, thanks to @Henry it seems what I did was correct but not what I should have done to produce the results I wanted. What I want is the to produce the following figures myself.
The image is taken from
Pestov, Vladimir (2007): Intrinsic dimension of a dataset: what properties does one expect? In : 2007 IEEE International Joint Conference on Neural Networks Proceedings. International Joint Conference on Neural Networks.
Fig. 14 of
Chávez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; Marroqu\’ın, José Luis (2001): Searching in metric spaces. In ACM Computing Surveys 33 (3), pp. 273–321. DOI: 10.1145/502807.502808.
is a similar one.
The text does not describe how to create them. How can I do this?