Distance distribution in different dimensions

Question

This is a crosspost from this question on stackoverflow. It received no answers because "it is a math question". I don't know how to move so I crosspost.

There are so many questions about distance distributions and the curse of dimensionality I did not read them all. But those that I did read, did not answer my question.

I want to create the distance distribution of uniformly distributed points in different dimensions, to ultimately visualize the curse of dimensionality. My understanding is that the distribution goes from a flatter curve in low dimensions to a spike curve in higher dimensions. They look like this:

Mean increases, variance decreases. Don't mind numbers, this is just a sketch (with gaussian distribution) I have seen corresponding figures in papers. (Would be able to show and cite, but not sure about copyrights.) I want to recreate those figures. But when I create random points calculate the distances and put them in to a histgram it looks like this:

The first one is the plain histogram of distances. The second one, the bin are scaled to the max distance of the repective dimensions. In the third one I overlayed them by subtracting the value of the lowest non-empty bin. The first one obviously has the means, but neither increasing height nor decreasing variance. In the second you can see the concentration of the distances, but not height or means. This is not what I saw in those papers. The third indicates that the curves have all the same shape. Again, not what I expected. This is the code I used to create the second figure.

import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
if name == 'main':
    records = 1000
    binwidth = 0.01
    precision = 10
    fig, ((ax1), (ax2), (ax3)) = plt.subplots(3, 1)
    fig.set_size_inches(6, 10)
    for dimensions in [2, 10, 100, 1000]:
        data = np.random.uniform(low=0, high=1, size=(records, dimensions))
        data = np.around(data, decimals=precision)
        distlist = distance.pdist(data)
    variance = np.var(distlist)
    print('dim {dim}: var={var}'.format(dim=dimensions, var=variance))

    maxdist = math.pow(dimensions, 1 / 2)
    bins = [round(x * binwidth, 2) for x in range(round(maxdist / binwidth) + 2)]
    histo = np.histogram(distlist, bins)

    harr = np.stack((histo[1][1:], histo[0]), axis=-1)
    mask = harr[:, 1] &gt; 0

    masked = harr[mask]
    ax1.plot(masked[:, 0], masked[:, 1], label='dim {dim}'.format(dim=dimensions))
    ax2.plot(masked[:, 0] / maxdist, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
    min = np.min(masked[:, 0])
    ax3.plot(masked[:, 0] - min, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
ax1.set_title('histogram')
ax1.legend()

ax2.set_title('histogram, bins divided by max distance')
ax2.legend()

ax3.set_title('histogram, bin subtracted by minimum bin')
ax3.legend()
plt.show()

I see two points where I may be wrong.

My understanding of distance distributions
the code does not what it's intended to do

How do I draw a histogram with bins of width 0.01 showing the number of distances between pairs of uniformly distributed points in the unit cube?

EDIT Ok, thanks to @Henry it seems what I did was correct but not what I should have done to produce the results I wanted. What I want is the to produce the following figures myself.

The image is taken from

Pestov, Vladimir (2007): Intrinsic dimension of a dataset: what properties does one expect? In : 2007 IEEE International Joint Conference on Neural Networks Proceedings. International Joint Conference on Neural Networks.

Fig. 14 of

Chávez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; Marroqu\’ın, José Luis (2001): Searching in metric spaces. In ACM Computing Surveys 33 (3), pp. 273–321. DOI: 10.1145/502807.502808.

is a similar one.

The text does not describe how to create them. How can I do this?

Your second chart called "histogram" is essentially equivalent to those suggested in https://math.stackexchange.com/a/1985698/6460 where the mean is slightly below $\sqrt{\frac{n}{6}}$ and the standard deviation about $\sqrt{\frac{7}{120}} \approx 0.24$ which is the constant limit as $n$ increases and is suggested by your graph. Subtracting the mean gives convergence to a normal distribution as $n$ increases, with a justification at https://stats.stackexchange.com/questions/241504/central-limit-theorem-for-square-roots-of-sums-of-i-i-d-random-variables — Henry, Nov 03 '22 at 02:06
So with the hypercube $[0,1]^{1000}$ it is indeed correct that over $99.99%$ of pairs of points chosen independently and uniformly are a distance between $12$ and $14$ apart despite the possible range of distances being from $0$ to $\sqrt{1000}\approx 31.62$. — Henry, Nov 03 '22 at 02:12

score 0 · Accepted Answer · answered Nov 04 '22 at 17:28

Does the following work? The code samples $n=1000$ points on $[0,1]^d$ for $d=2,5,10, 50,100,200, 1000, 2000, 10000$. For each $d$, we will compute the pairwise Euclidean distance between the $n$ points and plot a histogram. As the dimension increases, we will see that the distribution of the distances becomes more and more concentrated. This is bad because it means that all the points have similar distance to each other, and so we cannot really figure out different labels for these points (say in a classification problem).

Note: the 2-norm has min value zero, and max value $\sqrt{2}$, so for dimension 2, the maximum 2-norm is $\sqrt{2}$, for dim=d, the maximum is $\sqrt{d}$, which is why the x-axis range changes from plot to plot. You can change this x axis argument to suit your needs. Also you can play around with bin widths by changing the bins parameter in the histogram function

import numpy as np
import matplotlib.pyplot as plt
n = 1000
dims = [2, 5, 10, 50, 100, 200, 1000, 2000, 10000]
fig, axs = plt.subplots(3,3, figsize=(10,10))
for i, ax in enumerate(axs.flat):
    d = dims[i]
# randomly sample N points from [0,1]^dim
X = np.random.uniform(low=0., high=1., size=(n, d))

# compute pairwise distances between the points
pdists = np.array([np.linalg.norm(X[i]-X[j], ord=2) for i in range(n) for j in range (i+1, n)])

# plot histogram
ax.hist(pdists, range=[0, np.sqrt(d)], bins=50)  
ax.set_title(f&quot;dim = {d}&quot;)

plt.suptitle(f"Distribution of Parwise Distances for n={n} sampled points on unit cube")
plt.tight_layout()

Distance distribution in different dimensions

1 Answers1