$$\newcommand{\n}{\hat{n}}\newcommand{\w}{\hat{\omega}}\newcommand{\wi}{\w_\mathrm{i}}\newcommand{\wo}{\w_\mathrm{o}}\newcommand{\wh}{\w_\mathrm{h}}\newcommand{\Li}{L_\mathrm{i}}\newcommand{\Lo}{L_\mathrm{o}}\newcommand{\Le}{L_\mathrm{e}}\newcommand{\Lr}{L_\mathrm{r}}\newcommand{\Lt}{L_\mathrm{t}}\newcommand{\O}{\mathrm{O}}\newcommand{\degrees}{{^{\large\circ}}}\newcommand{\T}{\mathsf{T}}\newcommand{\mathset}[1]{\mathbb{#1}}\newcommand{\Real}{\mathset{R}}\newcommand{\Integer}{\mathset{Z}}\newcommand{\Boolean}{\mathset{B}}\newcommand{\Complex}{\mathset{C}}\newcommand{\un}[1]{\,\mathrm{#1}}$$

  

Theory and statistics

Can we analyse the layout further? In particular there are several questions that remain:

One way to address these issues is to take a stochastic approach. This can either be approached analytically, looking at probabilistic number theory to determine factorisation relations among random integers and then linking this to random graph theory and/or random matrix theory (e.g. what is the distribution of eigenvalues of the factorisation connectivity graph).

There are also empirical approaches, where we can analyse results obtained statistically as well as compare against surrogate datasets which have been simulated based on analytical or statistical estimates (e.g. comparing with a draw from a sample of a random graph with the same edge weight distribution).

UMAP is fundamentally a topological approach. Our argument might proceed:

  

Notation

  

What is the transform doing?

We can see the transform as being analogous to a logarithmic transform, replacing a product of integers with a sum of vectors.

For any natural number \(i \in \mathbb{N}\), by the fundamental theorem of arithmetic:

$$i = \prod_{k=1}^{\pi(i)} p_k^{a_k},$$ where \(a_k\) are the corresponding exponents of each prime \(p_k\).

Replacing with \(i\) with the corresponding squarefree \(i^\prime\) that has no repeated exponents:

$$i^\prime = \prod_{k=1}^{\pi(i^\prime)} b_k p_k,$$ where \(b_k\) is a binary indicator variable (0 or 1). By the properties of the logarithm,

$$\log(i^\prime) = \log\left( \sum_{k=1}^{\pi(i^\prime)} b_k p_k \right)$$

In the vector space representation, \(i^\prime\) is represented as a vector \(\mathbf{x_i}\): $$\mathbf{x_i} = \sum_{k=1}^{\pi(i^\prime)} b_k \mathbf{x_k},$$ where \(x_k\) is the \(k\) th standard basis vector, all zeros except a 1 in position \(k\) . These vectors are orthogonal and so do not immediately interact in the sum in the way the log sum does. However, when we apply dimensional reduction, we often form linear combinations of these vectors, which correspond to products in the original number domain.

  

The inner product and norms

\(\omega(i)\) is the count of unique prime factors of \(i\), and is therefore equal to the number of non-zero entries of \(\mathbf{x_i}\), \(\omega(i)=\sum_k x_{ik}\). This means that we can derive Minkowski \(p\) norms in terms of \(\omega(i)\)

$$\|x_i\|_p = \left[ \sum_k x_{ik}^p \right]^{1/p},$$ using the fact that \(x_i\) has only binary elements for \(i \in \mathbb{N}^{[2]}\), $$\|x_i\|_p = \left[ \sum_k x_{ik} \right]^{1/p}, $$ $$= \omega(i)^{1/p}. $$

There is a second important relation in that the unique factors of the greatest common divisor of two numbers \(i^\prime\) and \(j^\prime\) \(\omega(\gcd(i^\prime, j^\prime))\) is equivalent to the dot product of \(\mathbf{x_i}\) and \(\mathbf{x_j}\), \(\mathbf{x_i} \cdot \mathbf{x_j}\), which can be seen from the fact that the dot product will count the intersection of factors in both \(x_i\) and \(x_j\).

$$\omega(\gcd(i,j)) = \mathbf{x_i} \cdot \mathbf{x_j}$$

This means that in the vector space, we have our basic operations (assuming that \(i, j \in \mathbb{N}^{[2]}\) ):

  

The cosine distance

The layouts plotted use the cosine distance:

$$d(x_i, x_j) = 1 - \frac{x_i \cdot x_j}{\|x_i\|_2\|x_j\|_2}.$$

By the results above,

$$d(x_i, x_j) = 1 - \frac{x_i \cdot x_j}{\sqrt{\omega(i)}\sqrt{\omega(j)}}\\ =1-\frac{\omega(\gcd(i, j))}{\sqrt{\omega(i)}\sqrt{\omega(j)}}. $$

Correspondingly, if we view the problem as a graph with one vertex per natural number \(i\) where edges are more heavily weighted the more closely numbers are related, then the edge weight \(w_{ij}\) between vertex \(i\) and \(j\) is: $$w_{ij}= \frac{\omega(\gcd(i, j))}{\sqrt{\omega(i)}\sqrt{\omega(j)}},$$ changing the distance to a similarity.

For general \(i,j \in \mathbb{N}\), then we can define \(d(i,j)\) to be the equivalent, where the sqaurefreeing map \(i \rightarrow i^\prime, j \rightarrow j^\prime\) is performed before distances are computed.

$$d(i,j) = 1-\frac{\omega(\gcd(i^\prime, j^\prime)}{\sqrt{\omega(i^\prime)}\sqrt{\omega(j^\prime)}}.$$

The dot product is not altered for general \(i,j\), but the \(p\) norm would be different without remapping to the squarefree numbers.

  

Some statistics

We can put some simple bounds on the elements of vectors for a given limit \(n\).

For a matrix representing \(n\) numbers:

Therefore, $$1\leq \omega(i) \leq p(k)$$ The total number of non-zero vectors in the whole matrix is just $$\sum_{i=2}^n \omega(i).$$

\(p(k)\) is approximately $$ p(k) \approx e^{k\log(k)}$$

Primorial function versus \(e^{k \log k}\)

  

Erdős–Kac

There is a very elegant result in probabilistic number theory, the Erdős–Kac theorem. This, in summary, says that the number of unique prime factors of (large) \(i\) is $$\omega(i) \sim \mathcal{N}(\log \log i, \sqrt{\log \log i}).$$ That is, the number of unique factors is a random variable normally distributed with mean \(\log \log n\) and standard deviation \(\sqrt{\log \log n}.\)

Erdos-Kac theorem. The empirical number of factors for a given number, averaged over a running window of 500 consecutive numbers, against \(\log \log n\), with a constant of 0.31 added. There is a very close match.

It appears that the \(\log \log i\) result is close, but there is a constant offset of \(\approx 0.31\).

  

n=1000000

We can examine these properties for the 1M vector example.

By Erdős-Kac we would expect there to be $$\frac{1}{n-2} \sum_{i=2}^{n} \mathbb{E}[\omega(i)] = \frac{1}{n-2} \sum_{i=2}^n \log \log i + C $$ entries. With C=0.31, For 1M integers, this works out to an expected 2.85 entries per row or a matrix density of 0.0036398%; a pretty good approximation!

  

Sparsity

It should be apparent that this space is extremely sparsely populated. There are \(2^{78498}\) corners of the cube in which the data lies, of which only 2853710 (about \(2^{21}\)) are occupied, so \(2^{78477}\) corners are empty. This is ignoring the interior volume of the cube, which is entirely empty, though a vanishingly small proportion of the volume compared to the volume near the corners.

  

Distribution of distances

In the layout plotted originally, distances are computed with the cosine distance. One question that occurs is: what is the distribution of distances of pairs of these vectors under this metric? Since the layout generated by UMAP depends only on the inter-vector distances, this gives us some insight into the vectors we might expect. As discussed above, every vector corresponding to a prime is an orthogonal basis vector and has a cosine distance of 1.0 to every other prime (90° separation).

The distribution can be examined empirically. The plot below shows a histogram of the distribution of the cosine distances between pairs of vectors. It is based on a random sample of 20M pairs of (non-equal) rows from the 1M point dataset. There is a clear spike at 1.0, but an interesting distribution in between.

Distribution of cosine distances between vectors representing random integers from 2 to 1M, computed from 20M samples.

  

Distribution of GCD

By the definition above, we can approach the distribution of cosine distances from probabilistic number theory.

The distribution of \(\gcd(i,j)\) for random integers \(i, j\) is known Fernandez 2013 and Diaconis and Erdos 2004. It has PMF:

$$ P(\gcd(i,j)=k) = \frac{1}{\zeta(2)k^2} + O\left(\frac{1+\log(n+k)}{nk}\right), \quad 1 \leq i,j,k \leq n $$

and expectation:

$$ \mathbb{E}(\gcd(i,j)) = \frac{1}{\zeta(2)} \log(n) + C + O\left(\frac{1+\log(n/k)}{\sqrt{n}}\right), \quad 1 \leq i,j \leq n$$ $$ = \frac{6}{\pi^2} \log(n) + C + O\left(\frac{1+\log(n/k)}{\sqrt{n}}\right),$$ $$ \approx \frac{6}{\pi^2} \log(n), $$

Predicted GCD of random integers \(i,j \sim U(2,n)\) and sample mean of the GCD of 2000 random integer pairs \(\leq n\) for various \(n\). The approximation is a reasonable fit.

Assuming that \(i,j \sim U(2,n)\), and with \(\mathbb{E}[\omega(i)] = \log \log(i)\) we can therefore derive the expected value of the distances to be:

$$\mathbb{E}(d(i,j)) \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)}{\sqrt{\log \log i}\sqrt{\log \log j}}, \quad 1 \leq i,j \leq n$$ $$\mathbb{E}(d(n)) \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)} {\sqrt{\log \log (n/2)}\sqrt{\log \log (n/2)}}$$ $$ \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)} {\log \log (n/2)}$$

where \(\mathbb{E[d(n)]}\) means the expected cosine distance of two numbers \(i,j \sim U(2,n)\).

  

Coprimality and random graphs.

The probability that two numbers \(i,j \leq n\) are coprime (i.e. \(\gcd(i,j)=1\)) approaches \(\frac{1}{\zeta(2)}=\frac{6}{\pi^2}\) MathWorld. If we ignore the weighting of the graph edges and treat the edges as binary, then the relations between the set of natural numbers \(\leq n\) as described above is an Erdős-Rényi random graph \(G(n, 1-\frac{1}{\zeta(2)}) \approx G(n, p=0.39)\). The graph will be almost surely connected if \(1-\frac{1}{\zeta(2)} > 5/n\), which happens for \(n=13\).

formatted by Markdeep 1.02