$\ell_{2,1}$-Norm Regularized Discriminative Feature Selection for Unsupervised Learning

derivation of some formulars

Posted by Kwan on March 24, 2020

1. Notations

According to the paper, the notations used are summarized as follows: denote $\mathcal{X}=\left \lbrace x_{1}, x_{2}, \ldots, x_{n}\right\rbrace$ as the training set, where $x_{i} \in \Bbb{R}^{d}(1 \leq i \leq n)$ is the $i$-th datum and $n$ is the total number of training data, $I$ is identity matrix, for a constant $m$, ${\bf{1}}_ {m}\in\Bbb{R}^{m}$ is a column vector with all of its elements being 1 and $H_{m}=I-\frac{1}{m} {\bf{1}}_ {m} {\bf{1}}_ {m}^{T} \in \Bbb{R}^{m \times m}$. Suppose the $n$ training data $x_{1}, x_{2}, \ldots, x_{n}$ are sampled from $c$ classes and there are $n_i$ samples in the $i$-th class, then $y_{i} \in\lbrace0,1\rbrace^{c \times 1}(1 \leq i \leq n)$ is the label vector of $x_i$. The $j$-th element of $y_i$ is 1 if $x_i$ belongs to the $j$-th class, and 0 otherwise. $Y=\begin{bmatrix}y_{1}& y_{2}& \ldots& y_{n}\end{bmatrix}^{T} \in\lbrace0,1\rbrace^{n \times c}$ is the label matrix, where $c$ is the total number of clusters.

2. Total scatter matrix

Property 1.

Total scatter matrix $S_{t}=\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}=\tilde{X} \tilde{X}^{T}$, where $\mu$ is the mean of all samples, and $\tilde{X}=X H_{n}$ is the data matrix after being centered.

Proof:

Since $H_{n}=I-\frac{1}{n} {\bf{1}}_ {n} {\bf{1}}_ {n}^{T}=I-\frac{1}{n} \begin{bmatrix}1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\\vdots & \vdots & \ddots & \vdots \\1 & 1 & \cdots & 1 \end{bmatrix}$, then $XH_ n=X-\frac{1}{n}X\begin{bmatrix}1 & 1 & \cdots & 1 \\1 & 1 & \cdots & 1 \\\vdots & \vdots & \ddots & \vdots \\1 & 1 & \cdots & 1 \end{bmatrix}=X-\frac{1}{n}\begin{bmatrix}x_1^1 & x_2^1 & \cdots & x_n^1 \\x_1^2 & x_2^2 & \cdots & x_n^2 \\\vdots & \vdots & \ddots & \vdots \\x_1^d & x_2^d & \cdots & x_n^d \end{bmatrix}\begin{bmatrix}1 & 1 & \cdots & 1\\1 & 1 & \cdots & 1\\\vdots & \vdots & \ddots & \vdots\\1 & 1 & \cdots & 1\end{bmatrix}\\ =X-\begin{bmatrix} \frac{1}{n}\sum_{i=1}^n x_i^1 & \frac{1}{n}\sum_{i=1}^n x_i^1 & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^1 \\ \frac{1}{n}\sum_{i=1}^n x_i^2 & \frac{1}{n}\sum_{i=1}^n x_i^2 & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^2 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{1}{n}\sum_{i=1}^n x_i^d & \frac{1}{n}\sum_{i=1}^n x_i^d & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^d\end{bmatrix}=X-\it{M}$

Here, $x_i^j$ stands for the $j$-th element of the $i$-th datum. Noting that every column of the matrix $\it{M}$ are identical, besides, they coincide with the mean $\mu$. We here rewrite $\tilde{X}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}$, and we can henceforth have:

\[S_{t}=\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}\begin{bmatrix} (x_1-\mu)^T \\\\ (x_2-\mu)^T \\\\ \vdots \\\\ (x_n-\mu)^T \end{bmatrix}=\tilde{X} \tilde{X}^{T},\]

which completes the proof.

3. Between class scatter matrix

property 2.

Between class scatter matrix $S_{b}=\sum_{i=1}^{c} n_{i}\left(\mu_{i}-\mu\right)\left(\mu_{i}-\mu\right)^{T}=\tilde{X} G G^{T} \tilde{X}^{T}$, where $\mu_i$ is the mean of samples in the $i$-th class, and $G = \begin{bmatrix}G_{1} & \ldots & G_{n}\end{bmatrix}^{T}=Y\left(Y^{T} Y\right)^{-1 / 2}$ is the scaled label matrix.

Proof:

We first insert $G =Y\left(Y^{T} Y\right)^{-1 / 2}$ into the objective formula, and rewrite $S_{b}=\tilde{X} Y\left(Y^{T} Y\right)^{-1 / 2} \left(Y^{T} Y\right)^{-T / 2}Y^{T} \tilde{X}^{T}=\tilde{X} Y\left(Y^{T} Y\right)^{-1}Y^{T} \tilde{X}^{T}=\tilde{X} Y\left(Y^{T} Y\right)^{-1}(\tilde{X}Y)^{T}$. We now divide the formula into 2 parts and analysis them respectively.

1. $\tilde{X} Y$ part.

According to the section 2, we have that $\tilde{X}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}$, hence, $\tilde{X} Y=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$. Since the column vector $y_i$ represents the assigned cluster of the $i$-th datum, then, the $y_i^T$ is a row vector whose elements are all 0s expect an 1 at the corresponding index which is coincide with the index of the cluster to which $x_i$ belongs. So, consider 2 rows $y_i^T,~y_j^T$in the matrix $\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$, if $x_i$ and $x_j$ belong to the $o$-th cluster $c_o$ simultaneously, they are identical, with their 1s at the $o$-th index, i.e., the $o$-th column of the matrix. Further more, consider all the samples belong to the $o$-th cluster $c_o$, they all contribute their 1s to the $o$-th column of the matrix and the other samples that are not belong to the cluster $c_o$ contribute 0s to the $o$-th column of the matrix. Now let’s change perspective to the column vector of the matrix $\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \\ \end{bmatrix}=\begin{bmatrix} y^1 & y^2 & \ldots & y^c \end{bmatrix}$, it is easy to find out that the $j$-th column $y^j$ is a vector with length $n$, and it represents the indicator vector of the cluster $j$ indicating all the samples that belong to the $j$-th cluster. That is, if 3 samples $x_d,~x_e,~x_f$ belong to the $j$-th cluster, then, the $d$-th, $e$-th and the $f$-th elements of $y^j$ are 1, and as for the other indices, they take value 0. From this point of the view, the matrix $\tilde{X} Y$ has it’s special meaning. Consider the element $(\tilde{X} Y)_ {ij}=\begin{bmatrix} x_ 1^i & x_ 2^i & \ldots & x_ n^i \end{bmatrix}y^ j$, recall that $y^ j$ encodes the samples which belong to the $j$-th cluster $c_j$ in it’s index, then $(\tilde{X} Y)_ {ij}=\sum_ {x_ b \in c_j}(x_ b^i-\mu_ i)$. Extend the above to the whole matrix, we can easily turn out that $\tilde{X} Y=\begin{bmatrix} \sum_ {x_ b \in c_1}^n (x_ b^1-\mu_ 1) & \sum_ {x_ b \in c_2} (x_ b^1-\mu_ 1) & \ldots & \sum_ {x_ b \in c_c} (x_ b^1-\mu_ 1) \\ \sum_ {x_ b \in c_1} (x_ b^2-\mu_ 2) & \sum_ {x_ b \in c_2} (x_ b^2-\mu_ 2) & \ldots & \sum_ {x_ b \in c_c} (x_ b^2-\mu_ 2) \\ \vdots & \vdots & \ddots & \vdots \\ \sum_ {x_ b \in c_1} (x_ b^d-\mu_ d) & \sum_ {x_ b \in c_2} (x_ b^d-\mu_ d) & \ldots & \sum_ {x_ b \in c_c} (x_ b^d-\mu_ d)\end{bmatrix}\\=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix} \in \Bbb{R}^{d \times c}.$

2. $\left(Y^{T} Y\right)^{-1}$ part.

We first consider the matrix $Y^{T} Y$. Since $Y=\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$, hence $Y^{T} Y=\begin{bmatrix} y_ 1 & y_ 2 & \ldots & y_ n \end{bmatrix}\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}=\sum_ {i=1}^n y_ i y_i^T$. Consider the term $y_ i y_i^T$ in the summation, since $y_i$ represents the index of the cluster that sample $x_ i$ belongs to, i.e., if $x_ i $ belongs to the $j$-th cluster, then $y_ i^j=1$ and $y_ i^k=0\,for\text{ }all\text{ }k \neq j$, which means, $y_i=\begin{bmatrix} 0 \\0 \\ \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\0 \\\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }1\leftarrow j\text{-th} \\0 \\ \vdots \\ 0 \\ 0 \end{bmatrix}$. Now, the outer product of $y_ i$ and itself $y_ i y_i^T$ is a matrix who has only one non-zero element of value 1 at $(j,j)$ position(must located at the diagonal) and all 0s elsewhere. That is, $y_ i y_i^T=\begin{bmatrix}0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }1\leftarrow (j,j) & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 &0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \end{bmatrix}$. Consider different outer products $y_ a y_a^T$ and $y_ b y_b^T$, if samples $a$ and $b$ belong to the same cluster, namely the $l$-th cluster, then, $y_ a y_a^T+y_ b y_b^T$ is a matrix who has only one non-zero element of value 2 at $(l,l)$ position, and all 0s elsewhere. Further more, consider all the samples belong to the $l$-th cluster simultaneously, the summation of the outer products of the feature vectors of these samples themselves is a matrix who has only one non-zero element of value $\sum_ {x_b \in c_ l}1$, i.e., the size of the $l$-th cluster, at $(l,l)$ position, and all 0s elsewhere. That is, $\sum_ {x_b \in c_ l}y_ b y_ b ^T=\begin{bmatrix}0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\sum_ {x_b \in c_ l}1\leftarrow (j,j) & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 &0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \end{bmatrix}$ From now on, we can easily extend the above to the total summation of the outer products of the feature vectors of all these samples themselves. $Y^{T} Y=\sum_ {i=1}^n y_ i y_i^T =\sum_ {x_b \in c_ 1}y_ b y_ b ^T+\sum_ {x_b \in c_ 2}y_ b y_ b ^T+\dots+\sum_ {x_b \in c_ c}y_ b y_ b ^T\\=\begin{bmatrix} \sum_ {x_ b \in c_1} 1 & 0 & \ldots & 0 \\ 0 & \sum_ {x_ b \in c_2} 1 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sum_ {x_ b \in c_c} 1\end{bmatrix}=\begin{bmatrix} n_ 1 & 0 & \ldots & 0 \\ 0 & n_ 2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & n_ c\end{bmatrix}.$

Here, $n_i=\sum_ {x_ b \in c_i} 1$ denotes the size of the $i$-th cluster. Since $Y^{T} Y$ is a diagonal matrix, the inverse of $Y^{T} Y$, namely $(Y^{T} Y)^ {-1}$, is also a diagonal matrix, and $(Y^{T} Y)^ {-1}=\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}$.

3. Aggregation

Since $\tilde{X} Y=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix} $ and $(Y^{T} Y)^ {-1}=\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}$, then,

$\tilde{X} Y\left(Y^{T} Y\right)^{-1}(\tilde{X}Y)^{T}\\=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix}\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu)^ T \\ \sum_ {x_ b \in c_2} (x_ b-\mu)^ T \\ \vdots \\ \sum_ {x_ b \in c_c} (x_ b-\mu)^ T \end{bmatrix}\\=\begin{bmatrix} \frac{1}{n_ 1}\sum_ {x_ b \in c_1} (x_ b-\mu) & \frac{1}{n_ 2}\sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \frac{1}{n_ c}\sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix}\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu)^ T \\ \sum_ {x_ b \in c_2} (x_ b-\mu)^ T \\ \vdots \\ \sum_ {x_ b \in c_c} (x_ b-\mu)^ T \end{bmatrix}\\=\frac{1}{n_ 1}\sum_ {x_ b \in c_1} (x_ b-\mu)\sum_ {x_ b \in c_1} (x_ b-\mu) ^ T+\frac{1}{n_ 2}\sum_ {x_ b \in c_2} (x_ b-\mu)\sum_ {x_ b \in c_2} (x_ b-\mu) ^ T\\+\ldots+\frac{1}{n_ c}\sum_ {x_ b \in c_c} (x_ b-\mu)\sum_ {x_ b \in c_c} (x_ b-\mu) ^ T\\=\frac{n_ 1 ^ 2}{n_ 1}\sum_ {x_ b \in c_1} \frac{(x_ b-\mu)}{n_ 1}\sum_ {x_ b \in c_1} \frac{(x_ b-\mu)^ T}{n_ 1} +\frac{n_ 2 ^ 2}{n_ 2}\sum_ {x_ b \in c_2} \frac{(x_ b-\mu)}{n_ 2}\sum_ {x_ b \in c_2} \frac{(x_ b-\mu)^ T}{n_ 2}\\+\ldots+\frac{n_ c ^ 2}{n_ c}\sum_ {x_ b \in c_c} \frac{(x_ b-\mu)}{n_ c}\sum_ {x_ b \in c_c} \frac{(x_ b-\mu)^ T}{n_ c}\\=n_ 1(\mu_1 - \mu)(\mu_1 - \mu)^ T+n_ 2(\mu_2 - \mu)(\mu_2 - \mu)^ T+\ldots+n_ c(\mu_c - \mu)(\mu_c - \mu)^ T\\=\sum_ {i=1} ^ c n_ i(\mu_ i - \mu)(\mu_ i - \mu)^ T\\=S_{b}$

which completes the proof.

☘ 还没看完 看完了再决定有没有值得写的