.
# Theory and statistics
Can we analyse the layout further? In particular there are several questions that remain:
* Can we derive general results about the layout for any N? For example, what distribution of clusters might we expect see.
* Can we identify which parts of this layout are specific to prime factorisation, and which are "common" properties?
One way to address these issues is to take a stochastic approach. This can either be approached analytically, looking at probabilistic number theory to determine factorisation relations among random integers and then linking this to random graph theory and/or random matrix theory (e.g. what is the distribution of eigenvalues of the factorisation connectivity graph).
There are also empirical approaches, where we can analyse results obtained statistically as well as compare against surrogate datasets which have been simulated based on analytical or statistical estimates (e.g. comparing with a draw from a sample of a random graph with the same edge weight distribution).
UMAP is fundamentally a topological approach. Our argument might proceed:
* probabilistic number theory -> distributions over properties of random integers -> distributions over edge weights of a graph.
* random graph theory -> graph theoretic properties of this random graph -> properties of the simplicial complex UMAP identifies -> properties of the layout as a function of the complex.
* [possibly] random matrix theory -> properties of the adjacency matrix of the graph
## Notation
* $i,j,n$ natural numbers, such that $i,j \leq n$
* $i^\prime$, the [squarefree](https://en.wikipedia.org/wiki/Square-free_integer) "version" of $i$ with all repeated prime factors collapsed to an exponent of 1. For example if $i=12=2^2 \times 3$, then $i^\prime=2^1 \times 3^1=6$.
* $s(i)$ is a function that maps integers to their squarefree forms ($i\rightarrow i^\prime$).
* $\mathbb{N}^{[2]}$ the set of all squarefree numbers, i.e. with prime factor exponent at most 1. $\{2,3,5,6,7,11,13,14,17,19,21,...\}$
* $\omega(i)$ the count of unique prime factors of $i$
* $p(k)$ the [primorial](https://en.wikipedia.org/wiki/Primorial) function for index $k$
* $p_k$ the $k$ th prime number.
* $\pi(n)$ the number of prime numbers $\leq n$.
* $\zeta(z)$ the [Riemann zeta function](https://en.wikipedia.org/wiki/Riemann_zeta_function).
## What is the transform doing?
We can see the transform as being analogous to a logarithmic transform, replacing a product of integers with a sum of vectors.
For any natural number $i \in \mathbb{N}$, by the fundamental theorem of arithmetic:
$$i = \prod_{k=1}^{\pi(i)} p_k^{a_k},$$ where $a_k$ are the corresponding exponents of each prime $p_k$.
Replacing with $i$ with the corresponding squarefree $i^\prime$ that has no repeated exponents:
$$i^\prime = \prod_{k=1}^{\pi(i^\prime)} b_k p_k,$$ where $b_k$ is a binary indicator variable (0 or 1). By the properties of the logarithm,
$$\log(i^\prime) = \log\left( \sum_{k=1}^{\pi(i^\prime)} b_k p_k \right)$$
In the vector space representation, $i^\prime$ is represented as a vector $\mathbf{x_i}$:
$$\mathbf{x_i} = \sum_{k=1}^{\pi(i^\prime)} b_k \mathbf{x_k},$$ where $x_k$ is the $k$ th standard basis vector, all zeros except a 1 in position $k$ . These vectors are orthogonal and so do not immediately interact in the sum in the way the log sum does. However, when we apply dimensional reduction, we often form linear combinations of these vectors, which correspond to products in the original number domain.
### The inner product and norms
$\omega(i)$ is the count of unique prime factors of $i$, and is therefore equal to the number of non-zero entries of $\mathbf{x_i}$, $\omega(i)=\sum_k x_{ik}$. This means that we can derive Minkowski $p$ norms in terms of $\omega(i)$
$$\|x_i\|_p = \left[ \sum_k x_{ik}^p \right]^{1/p},$$
using the fact that $x_i$ has only binary elements for $i \in \mathbb{N}^{[2]}$,
$$\|x_i\|_p = \left[ \sum_k x_{ik} \right]^{1/p}, $$
$$= \omega(i)^{1/p}. $$
There is a second important relation in that the unique factors of the greatest common divisor of two numbers $i^\prime$ and $j^\prime$ $\omega(\gcd(i^\prime, j^\prime))$ is equivalent to the dot product of $\mathbf{x_i}$ and $\mathbf{x_j}$, $\mathbf{x_i} \cdot \mathbf{x_j}$, which can be seen from the fact that the dot product will count the intersection of factors in both $x_i$ and $x_j$.
$$\omega(\gcd(i,j)) = \mathbf{x_i} \cdot \mathbf{x_j}$$
This means that in the vector space, we have our basic operations (assuming that $i, j \in \mathbb{N}^{[2]}$ ):
* $\mathbf{x_i}+\mathbf{x_j}$ equivalent to $ij$
* $a\mathbf{x_i}$ equivalent to $i^{a}$
* $\mathbf{0}$ (the zero vector, additive identity) equivalent to $1$.
* $\mathbf{x_i} \cdot \mathbf{x_j}$ = $\omega(\gcd(i,j))$
* The $p$ norm $\|\mathbf{x_i}\|_p = \omega(\mathbf{x_i})^{1/p}$
### The cosine distance
The layouts plotted use the cosine distance:
$$d(x_i, x_j) = 1 - \frac{x_i \cdot x_j}{\|x_i\|_2\|x_j\|_2}.$$
By the results above,
$$d(x_i, x_j) = 1 - \frac{x_i \cdot x_j}{\sqrt{\omega(i)}\sqrt{\omega(j)}}\\
=1-\frac{\omega(\gcd(i, j))}{\sqrt{\omega(i)}\sqrt{\omega(j)}}. $$
Correspondingly, if we view the problem as a graph with one vertex per natural number $i$ where edges are more heavily weighted the more closely numbers are related, then the edge weight $w_{ij}$ between vertex $i$ and $j$ is:
$$w_{ij}= \frac{\omega(\gcd(i, j))}{\sqrt{\omega(i)}\sqrt{\omega(j)}},$$
changing the distance to a similarity.
For general $i,j \in \mathbb{N}$, then we can define $d(i,j)$ to be the equivalent, where the sqaurefreeing map $i \rightarrow i^\prime, j \rightarrow j^\prime$ is performed before distances are computed.
$$d(i,j) = 1-\frac{\omega(\gcd(i^\prime, j^\prime)}{\sqrt{\omega(i^\prime)}\sqrt{\omega(j^\prime)}}.$$
The dot product is not altered for general $i,j$, but the $p$ norm would be different without remapping to the squarefree numbers.
## Some statistics
We can put some simple bounds on the elements of vectors for a given limit $n$.
For a matrix representing $n$ numbers:
* There will be $\pi(n)$ columns and $n$ rows.
* We can characterise the number of non-zero elements $\omega(i)$ in each row $i$.
* Every row has at least one non-zero value (exactly one where prime) $\omega(i)\geq 1$
* Any row can have at most as many non-zero values as the largest $k$ for which the [primorial](https://en.wikipedia.org/wiki/Primorial) $p(k)$ < $i$
Therefore,
$$1\leq \omega(i) \leq p(k)$$
The total number of non-zero vectors in the whole matrix is just $$\sum_{i=2}^n \omega(i).$$
$p(k)$ [is approximately](https://math.stackexchange.com/questions/106260/interpolating-the-primorial-p-n)
$$ p(k) \approx e^{k\log(k)}$$
![Primorial function versus $e^{k \log k}$](imgs/primorial_approximation.svg)
#### Erdős–Kac
There is a very elegant result in probabilistic number theory, the [Erdős–Kac theorem](https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93Kac_theorem). This, in summary, says that the number of unique prime factors of (large) $i$ is $$\omega(i) \sim \mathcal{N}(\log \log i, \sqrt{\log \log i}).$$ That is, the number of unique factors is a random variable normally distributed with mean $\log \log n$ and standard deviation $\sqrt{\log \log n}.$
![Erdos-Kac theorem. The empirical number of factors for a given number, averaged over a running window of 500 consecutive numbers, against $\log \log n$, with a constant of 0.31 added. There is a very close match.](imgs/erdos_kac.png)
It appears that the $\log \log i$ result is close, but there is a constant offset of $\approx 0.31$.
### n=1000000
We can examine these properties for the 1M vector example.
* 100000 rows, 78498 columns, as $\pi(1000000)= 78498$.
* Min 1 non-zero element per row, max 7 elements per row.
* For $n=1000000$, the largest index is $k=7$ as $p_7=510510$ and $p_8=9699690$.
* Thus every row has between 1 and 7 non-zero elements.
* 0.003653% of all entries (2853710 of 78498000000) are non-zero
* Each row has mean 2.85 entries, std. dev. 0.99
By Erdős-Kac we would expect there to be $$\frac{1}{n-2} \sum_{i=2}^{n} \mathbb{E}[\omega(i)] = \frac{1}{n-2} \sum_{i=2}^n \log \log i + C
$$ entries. With C=0.31, For 1M integers, this works out to an expected 2.85 entries per row or a matrix density of 0.0036398%; a pretty good approximation!
#### Sparsity
It should be apparent that this space is extremely sparsely populated. There are $2^{78498}$ corners of the cube in which the data lies, of which only 2853710 (about $2^{21}$) are occupied, so $2^{78477}$ corners are empty. This is ignoring the interior volume of the cube, which is entirely empty, though a vanishingly small proportion of the volume compared to the volume near the corners.
### Distribution of distances
In the layout plotted originally, distances are computed with the cosine distance. One question that occurs is: what is the distribution of distances of pairs of these vectors under this metric? Since the layout generated by UMAP depends only on the inter-vector distances, this gives us some insight into the vectors we might expect. As discussed above, every vector corresponding to a prime is an orthogonal basis vector and has a cosine distance of 1.0 to every other prime (90 degrees separation).
The distribution can be examined empirically. The plot below shows a histogram of the distribution of the cosine distances between pairs of vectors. It is based on a random sample of 20M pairs of (non-equal) rows from the 1M point dataset. There is a clear spike at 1.0, but an interesting distribution in between.
![Distribution of cosine distances between vectors representing random integers from 2 to 1M, computed from 20M samples.](imgs/cosine_histogram.svg)
#### Distribution of GCD
By the definition above, we can approach the distribution of cosine distances from probabilistic number theory.
The distribution of $\gcd(i,j)$ for random integers $i, j$ is known [Fernandez 2013](https://arxiv.org/pdf/1305.0536.pdf) and [Diaconis and Erdos 2004](https://projecteuclid.org/download/pdf_1/euclid.lnms/1196285379). It has PMF:
$$
P(\gcd(i,j)=k) = \frac{1}{\zeta(2)k^2} + O\left(\frac{1+\log(n+k)}{nk}\right), \quad 1 \leq i,j,k \leq n
$$
and expectation:
$$ \mathbb{E}(\gcd(i,j)) = \frac{1}{\zeta(2)} \log(n) + C + O\left(\frac{1+\log(n/k)}{\sqrt{n}}\right), \quad 1 \leq i,j \leq n$$
$$ = \frac{6}{\pi^2} \log(n) + C + O\left(\frac{1+\log(n/k)}{\sqrt{n}}\right),$$
$$ \approx \frac{6}{\pi^2} \log(n), $$
![Predicted GCD of random integers $i,j \sim U(2,n)$ and sample mean of the GCD of 2000 random integer pairs $\leq n$ for various $n$. The approximation is a reasonable fit.](imgs/predicted_gcd.svg)
Assuming that $i,j \sim U(2,n)$, and with $\mathbb{E}[\omega(i)] = \log \log(i)$ we can
therefore derive the expected value of the distances to be:
$$\mathbb{E}(d(i,j)) \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)}{\sqrt{\log \log i}\sqrt{\log \log j}}, \quad 1 \leq i,j \leq n$$
$$\mathbb{E}(d(n)) \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)} {\sqrt{\log \log (n/2)}\sqrt{\log \log (n/2)}}$$
$$ \approx 1-\frac{\log\log \left( \frac{6\log(n)}{\pi^2} \right)} {\log \log (n/2)}$$
where $\mathbb{E[d(n)]}$ means the expected cosine distance of two numbers $i,j \sim U(2,n)$.
### Coprimality and random graphs.
The probability that two numbers $i,j \leq n$ are coprime (i.e. $\gcd(i,j)=1$) approaches $\frac{1}{\zeta(2)}=\frac{6}{\pi^2}$ [MathWorld](http://mathworld.wolfram.com/RelativelyPrime.html). If we ignore the weighting of the graph edges and treat the edges as binary, then the relations between the set of natural numbers $\leq n$ as described above is an [Erdős-Rényi](https://jeremykun.com/2013/08/22/the-erdos-renyi-random-graph/) random graph $G(n, 1-\frac{1}{\zeta(2)}) \approx G(n, p=0.39)$. The graph will be almost surely connected if $1-\frac{1}{\zeta(2)} > 5/n$, which happens for $n=13$.