On the equivalence of NMF and spectral clustering

Proof of $J_{\mathbf{K}}=\sum_{k=1}^{K} \sum_{i \in C_{k}}\left|\mathbf{x}{i}-\mathbf{m}{k}\right|^{2}=c_{2}-\sum_{k} \frac{1}{n_{k}} \sum_{i, j \in C_{k}} \mathbf{x}{i}^{T} \mathbf{x}{j}$. (Equation (4).)

\[\begin{aligned} J_{\mathbf{K}}&=\sum_{k=1}^{K} \sum_{i \in C_{k}}\left\|\mathbf{x}_{i}-\mathbf{m}_{k}\right\|^{2}\\&=\sum_{k=1}^{K} \sum_{i \in C_{k}}\left(\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\mathbf{x}_{i}^{T}\mathbf{m}_{k}+\mathbf{m}_{k}^{T}\mathbf{m}_{k}\right)\\&=\sum_{k=1}^{K} \sum_{i \in C_{k}}\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\sum_{k=1}^{K} \sum_{i \in C_{k}}\mathbf{x}_{i}^{T}\left(\sum_{j \in C_{k}}\frac{\mathbf{x}_{j}}{n_{k}}\right)+\sum_{k=1}^{K} \sum_{i \in C_{k}}\sum_{j \in C_{k}}\frac{\mathbf{x}_{j}^{T}}{n_{k}}\sum_{q \in C_{k}}\frac{\mathbf{x}_{q}}{n_{k}}\\&=\sum_{i \in \mathcal{N}}\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\sum_{k=1}^{K} \frac{1}{n_{k}}\sum_{i \in C_{k}}\sum_{j \in C_{k}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\sum_{k=1}^{K} \frac{1}{n_{k}^{2}}\sum_{i \in C_{k}}\sum_{j \in C_{k}}\sum_{q \in C_{k}}\mathbf{x}_{j}^{T}\mathbf{x}_{q}\\&=\sum_{i \in \mathcal{N}}\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\sum_{k=1}^{K} \frac{1}{n_{k}}\sum_{i \in C_{k}}\sum_{j \in C_{k}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\sum_{k=1}^{K} \frac{1}{n_{k}^{2}}\left(n_{k}\sum_{j \in C_{k}}\sum_{q \in C_{k}}\mathbf{x}_{j}^{T}\mathbf{x}_{q}\right)\\&=\sum_{i \in \mathcal{N}}\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\sum_{k=1}^{K} \frac{1}{n_{k}}\sum_{i \in C_{k}}\sum_{j \in C_{k}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\sum_{k=1}^{K} \frac{1}{n_{k}}\sum_{i \in C_{k}}\sum_{j \in C_{k}}\mathbf{x}_{i}^{T}\mathbf{x}_{j}\\&=c_{2}-\sum_{k} \frac{1}{n_{k}} \sum_{i, j \in C_{k}} \mathbf{x}_{i}^{T} \mathbf{x}_{j}. \end{aligned}\]

Proof of $J_{\mathbf{K}}=\operatorname{Tr}\left(X^{T} X\right)-\operatorname{Tr}\left(H^{T} X^{T} X H\right)$.

1st part.

The first part is relatively easier. Since $\operatorname{Tr}\left(X^{T} X\right)=|X|F^2=\sum{i \in \mathcal{N}}|\mathbf{x}_{i}|^2$.

2nd part.

The second part can be derived as follows. It is easy to see that $\begin{aligned} \operatorname{Tr}\left(H^{T} X^{T} X H\right)=\operatorname{Tr}\left(\left(X H\right)^{T} \left(X H\right)\right)=\|XH\|_F^2=\sum_{k=1}^{K}\|\left(XH\right)_k\|^2, \end{aligned}$ and $\begin{aligned} XH=\begin{bmatrix} (\mathbf{x}^{1})^T \\ (\mathbf{x}^{2})^T \\ \vdots \\ (\mathbf{x}^{p})^T \end{bmatrix} \begin{bmatrix} \mathbf{h}_{1} & \mathbf{h}_{2} & \cdots & \mathbf{h}_{k} \end{bmatrix}=\begin{bmatrix} \langle\mathbf{x}^{1},\mathbf{h}_{1}\rangle & \langle\mathbf{x}^{1},\mathbf{h}_{2}\rangle & \cdots & \langle\mathbf{x}^{1},\mathbf{h}_{K}\rangle \\ \langle\mathbf{x}^{2},\mathbf{h}_{1}\rangle & \langle\mathbf{x}^{2},\mathbf{h}_{2}\rangle & \cdots & \langle\mathbf{x}^{2},\mathbf{h}_{K}\rangle \\ \vdots & \vdots & \ddots & \vdots \\ \langle\mathbf{x}^{p},\mathbf{h}_{1}\rangle & \langle\mathbf{x}^{p},\mathbf{h}_{2}\rangle & \cdots & \langle\mathbf{x}^{p},\mathbf{h}_{K}\rangle \end{bmatrix}, \end{aligned}$ which indicates $(XH)k=\begin{bmatrix}\langle\mathbf{x}^{1},\mathbf{h}{k}\rangle \ \langle\mathbf{x}^{2},\mathbf{h}{k}\rangle \ \vdots \ \langle\mathbf{x}^{p},\mathbf{h}{k}\rangle\end{bmatrix}$.

From now on, we could obtain that: $\begin{aligned} \sum_{k=1}^{K}\|\left(XH\right)_k\|^2 &=\sum_{k=1}^{K}\left\|\begin{bmatrix}\langle\mathbf{x}^{1},\mathbf{h}_{k}\rangle \\ \langle\mathbf{x}^{2},\mathbf{h}_{k}\rangle \\ \vdots \\ \langle\mathbf{x}^{p},\mathbf{h}_{k}\rangle\end{bmatrix}\right\|^2\\&=\sum_{k=1}^{K}\frac{1}{n_k}\left\|\begin{bmatrix}\langle\mathbf{x}^{1},\mathbf{h}_{k}^{\prime}\rangle \\ \langle\mathbf{x}^{2},\mathbf{h}_{k}^{\prime}\rangle \\ \vdots \\ \langle\mathbf{x}^{p},\mathbf{h}_{k}^{\prime}\rangle\end{bmatrix}\right\|^2\\&=\sum_{k=1}^{K}\frac{1}{n_k}\begin{bmatrix}\sum_{i \in C_k}x_{i1} & \sum_{i \in C_k}x_{i2} & \cdots &\sum_{i \in C_k}x_{ip}\end{bmatrix}\begin{bmatrix}\sum_{j \in C_k}x_{j1}\\ \sum_{j \in C_k}x_{j2} \\ \vdots \\ \sum_{j \in C_k}x_{jp}\end{bmatrix}\\&=\sum_{k=1}^{K}\frac{1}{n_k}\sum_{i \in C_k}\sum_{j \in C_k}\sum_{q=1}^{p}x_{iq}x_{jq}\\&=\sum_{k=1}^{K}\frac{1}{n_k}\sum_{i,j \in C_k}\mathbf{x}_{i}^{T}\mathbf{x}_{j}. \end{aligned}$

Details in Proof for Theorem1.

\[\begin{aligned} &\hspace{1.16em}\left\|W\right\|^{2}-2 \operatorname{Tr}\left(H^{T} W H\right)+\left\|H^{T} H\right\|^{2} \\&=\operatorname{Tr}\left(W^{T}W\right)-2\operatorname{Tr}\left(H^{T} W H\right)+\operatorname{Tr}\left(H^{T} H H^{T} H\right)\\&=\operatorname{Tr}\left(W^{T}W-2WHH^{T}+H H^{T} H H^{T}\right)\\&=\operatorname{Tr}\left(\left(W-HH^{T}\right)^{T}\left(W-HH^{T}\right)\right)\\&=\left\|W-HH^{T}\right\|_F^2. \end{aligned}\]

A more thorough proof of Theorem 2.

From above, we acquire that

$\min _{H \geq 0} J_{1}=\left\|W-H H^{T}\right\|^{2}$ is equivalent to

$\min _{H \geq 0} -2 \operatorname{Tr}\left(H^{T} W H\right)+\left\|H^{T} H\right\|^{2}.$ In the above optimization scheme, we need to optimize 2 objectives simultaneously, i.e.,

$\max _{H \geq 0} \operatorname{Tr}\left(H^{T} W H\right)$, and
$\min _{H \geq 0}\left|H^{T} H\right|^{2}$.

The only distinction between 1. and $k$-means $\max {H^{T} H=I, H \geq 0} J{\mathrm{W}}(H)=\operatorname{Tr}\left(H^{T} W H\right)$ is the orthogonal constraint exerted on $H$. We are curious whether the 2. could bring orthogonality to $H$. Note that: $\left\|H^{T} H\right\|^{2}=\sum_{\ell k}\left(H^{T} H\right)_{\ell k}^{2}=\sum_{\ell \neq k}\left(\mathbf{h}_{\ell}^{T} \mathbf{h}_{k}\right)^{2}+\sum_{k}\left(\mathbf{h}_{k}^{T} \mathbf{h}_{k}\right)^{2},$ hence, minimize 2. is equivalent to minimize $\sum_{\ell \neq k}\left(\mathbf{h}{\ell}^{T} \mathbf{h}{k}\right)^{2}$ and $\sum_{k}\left(\mathbf{h}{k}^{T} \mathbf{h}{k}\right)^{2}$. Since $H$ is non-negative, minimize $\sum_{\ell \neq k}\left(\mathbf{h}{\ell}^{T} \mathbf{h}{k}\right)^{2}$ will force $\mathbf{h}{\ell}^{T} \mathbf{h}{k}$ becomes near $0$ for all $l \neq k$. On the other hand, minimize $\sum_{k}\left(\mathbf{h}{k}^{T} \mathbf{h}{k}\right)^{2}$ is as same as minimizing: $\min \left\|\mathbf{h}_{1}\right\|^{4}+\cdots+\left\|\mathbf{h}_{K}\right\|^{4}.$ Since we are optimizing 1. and 2. simultaneously, we can not obtain $\left|\mathbf{h}{i}\right|=0$ for all $i$ although that best minimizes the above objective. In other words, that is not the optimal solution of NMF. Since we are optimizing NMF, so $W$ and $HH^T$ are close to each other. Consider another constraint: $C=\sum_{i j} w_{i j} \approx \sum_{i j}\left(H H^{T}\right)_{i j}=\sum_{k i j} h_{i k} h_{j k}=\sum_{k}\left|\mathbf{h}_{k}\right|^{2},$ we obtain a constrained optimization problem as follow: $\min \left\|\mathbf{h}_{1}\right\|^{4}+\cdots+\left\|\mathbf{h}_{K}\right\|^{4} \\ \hspace{1.2em} \text{s.t.} \left|\mathbf{h}_{1}\right|^{2}+\cdots+\left|\mathbf{h}_{K}\right|^{2}=C.$ Write the Lagrange function of the above optimization problem: $\mathcal{L}=\left(\mathbf{h}_{1}^{T}\mathbf{h}_{1}\right)^{2}+\left(\mathbf{h}_{2}^{T}\mathbf{h}_{2}\right)^{2}+\cdots+\left(\mathbf{h}_{K}^{T}\mathbf{h}_{K}\right)^{2}+\lambda\left(\left(\mathbf{h}_{1}^{T}\mathbf{1}\right)^{2}+\left(\mathbf{h}_{2}^{T}\mathbf{1}\right)^{2}+\cdots+\left(\mathbf{h}_{K}^{T}\mathbf{1}\right)^{2}-C\right).$ Compute derivatives as follows: $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{1}}=2\mathbf{h}_{1}^{T}\mathbf{h}_{1}\cdot2\mathbf{h}_{1}+\lambda\cdot2\mathbf{h}_{1}^{T}\mathbf{1}\cdot\mathbf{1}=0, \\ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{2}}=2\mathbf{h}_{2}^{T}\mathbf{h}_{2}\cdot2\mathbf{h}_{2}+\lambda\cdot2\mathbf{h}_{2}^{T}\mathbf{1}\cdot\mathbf{1}=0, \\ \vdots \\ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{K}}=2\mathbf{h}_{K}^{T}\mathbf{h}_{K}\cdot2\mathbf{h}_{K}+\lambda\cdot2\mathbf{h}_{K}^{T}\mathbf{1}\cdot\mathbf{1}=0.$ So, $\lambda=\frac{-4\mathbf{h}_{1}^{T}\mathbf{h}_{1}\cdot\mathbf{h}_{1}}{2\mathbf{h}_{1}^{T}\mathbf{1}\cdot\mathbf{1}}=\frac{-4\mathbf{h}_{2}^{T}\mathbf{h}_{2}\cdot\mathbf{h}_{2}}{2\mathbf{h}_{2}^{T}\mathbf{1}\cdot\mathbf{1}}=\cdots=\frac{-4\mathbf{h}_{K}^{T}\mathbf{h}_{K}\cdot\mathbf{h}_{K}}{2\mathbf{h}_{K}^{T}\mathbf{1}\cdot\mathbf{1}}.$ By removing constants, we obtain $\frac{\mathbf{h}_{1}^{T}\mathbf{h}_{1}\cdot\mathbf{h}_{1}}{\mathbf{h}_{1}^{T}\mathbf{1}\cdot\mathbf{1}}=\frac{\mathbf{h}_{2}^{T}\mathbf{h}_{2}\cdot\mathbf{h}_{2}}{\mathbf{h}_{2}^{T}\mathbf{1}\cdot\mathbf{1}}=\cdots=\frac{\mathbf{h}_{K}^{T}\mathbf{h}_{K}\cdot\mathbf{h}_{K}}{\mathbf{h}_{K}^{T}\mathbf{1}\cdot\mathbf{1}},$ and we multiply numerator and denominator by $\mathbf{1}^{T}$, we could get $\frac{\mathbf{1}^{T}\mathbf{h}_{1}^{T}\mathbf{h}_{1}\cdot\mathbf{h}_{1}}{\mathbf{1}^{T}\mathbf{h}_{1}^{T}\mathbf{1}\cdot\mathbf{1}}=\frac{\mathbf{1}^{T}\mathbf{h}_{2}^{T}\mathbf{h}_{2}\cdot\mathbf{h}_{2}}{\mathbf{1}^{T}\mathbf{h}_{2}^{T}\mathbf{1}\cdot\mathbf{1}}=\cdots=\frac{\mathbf{1}^{T}\mathbf{h}_{K}^{T}\mathbf{h}_{K}\cdot\mathbf{h}_{K}}{\mathbf{1}^{T}\mathbf{h}_{K}^{T}\mathbf{1}\cdot\mathbf{1}},$ which means $\frac{\mathbf{h}_{1}^{T}\mathbf{h}_{1}\cdot(\mathbf{1}^{T}\mathbf{h}_{1})}{\mathbf{1}^{T}(\mathbf{h}_{1}^{T}\mathbf{1})\mathbf{1}}=\frac{\mathbf{h}_{2}^{T}\mathbf{h}_{2}\cdot(\mathbf{1}^{T}\mathbf{h}_{2})}{\mathbf{1}^{T}(\mathbf{h}_{2}^{T}\mathbf{1})\mathbf{1}}=\cdots=\frac{\mathbf{h}_{K}^{T}\mathbf{h}_{K}\cdot(\mathbf{1}^{T}\mathbf{h}_{K})}{\mathbf{1}^{T}(\mathbf{h}_{K}^{T}\mathbf{1})\mathbf{1}}.$ Canceling duplicates, we get: $\frac{\mathbf{h}_{1}^{T}\mathbf{h}_{1}}{\mathbf{1}^{T}\mathbf{1}}=\frac{\mathbf{h}_{2}^{T}\mathbf{h}_{2}}{\mathbf{1}^{T}\mathbf{1}}=\cdots=\frac{\mathbf{h}_{K}^{T}\mathbf{h}_{K}}{\mathbf{1}^{T}\mathbf{1}},$ which indicates that, $\left|\mathbf{h}{1}\right|=\left|\mathbf{h}{2}\right|=\cdots=\left|\mathbf{h}{K}\right|$.

We next prove $\left|\mathbf{h}_{i}\right| \neq 0$ for any $i$ by contradictory.

Assume $\left|\mathbf{h}{i}\right| = 0$, then $\left|\mathbf{h}{1}\right|=\left|\mathbf{h}{2}\right|=\cdots=\left|\mathbf{h}{K}\right|=0$, then $\left|\mathbf{h}_{i}\right| = 0$ for any $i$, which means all elements in $\mathbf{h}_i$ are zeroes. Hence, $

\mathbf{h}_{i}

=0$ for all $i$, and this further indicates $\left

\mathbf{h}_{1}\right

^{2}+\cdots+\left

\mathbf{h}_{K}\right

^{2}=0$, which breaks the constraint.

Since $\left|\mathbf{h}{i}\right| \neq 0$ and the norm is non-negative, we have $\left|\mathbf{h}{i}\right| > 0$.

Eventually, we could draw a conclusion that, by optimizing 2., we could obtain: $\mathbf{h}_{\ell}^{T} \mathbf{h}_{k} \approx\left\{\begin{array}{ll} 0 & \text { if } l \neq k, \\ \left\|\mathbf{h}_{k}\right\|^{2}>0 & \text { if } l=k. \end{array}\right.$