$\ell_{2,1}$-Norm Regularized Discriminative Feature Selection for Unsupervised Learning

1. Notations

According to the paper, the notations used are summarized as follows: denote $\mathcal{X}=\left \lbrace x_{1}, x_{2}, \ldots, x_{n}\right\rbrace$ as the training set, where $x_{i} \in \Bbb{R}^{d}(1 \leq i \leq n)$ is the $i$-th datum and $n$ is the total number of training data, $I$ is identity matrix, for a constant $m$, ${\bf{1}}_ {m}\in\Bbb{R}^{m}$ is a column vector with all of its elements being 1 and $H_{m}=I-\frac{1}{m} {\bf{1}}_ {m} {\bf{1}}_ {m}^{T} \in \Bbb{R}^{m \times m}$. Suppose the $n$ training data $x_{1}, x_{2}, \ldots, x_{n}$ are sampled from $c$ classes and there are $n_i$ samples in the $i$-th class, then $y_{i} \in\lbrace0,1\rbrace^{c \times 1}(1 \leq i \leq n)$ is the label vector of $x_i$. The $j$-th element of $y_i$ is 1 if $x_i$ belongs to the $j$-th class, and 0 otherwise. $Y=\begin{bmatrix}y_{1}& y_{2}& \ldots& y_{n}\end{bmatrix}^{T} \in\lbrace0,1\rbrace^{n \times c}$ is the label matrix, where $c$ is the total number of clusters.

2. Total scatter matrix

Property 1.

Total scatter matrix $S_{t}=\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}=\tilde{X} \tilde{X}^{T}$, where $\mu$ is the mean of all samples, and $\tilde{X}=X H_{n}$ is the data matrix after being centered.

Proof:

Since $H_{n}=I-\frac{1}{n} {\bf{1}}_ {n} {\bf{1}}_ {n}^{T}=I-\frac{1}{n} \begin{bmatrix}1 & 1 & \cdots & 1 \\ 1 & 1 & \cdots & 1 \\\vdots & \vdots & \ddots & \vdots \\1 & 1 & \cdots & 1 \end{bmatrix}$, then $XH_ n=X-\frac{1}{n}X\begin{bmatrix}1 & 1 & \cdots & 1 \\1 & 1 & \cdots & 1 \\\vdots & \vdots & \ddots & \vdots \\1 & 1 & \cdots & 1 \end{bmatrix}=X-\frac{1}{n}\begin{bmatrix}x_1^1 & x_2^1 & \cdots & x_n^1 \\x_1^2 & x_2^2 & \cdots & x_n^2 \\\vdots & \vdots & \ddots & \vdots \\x_1^d & x_2^d & \cdots & x_n^d \end{bmatrix}\begin{bmatrix}1 & 1 & \cdots & 1\\1 & 1 & \cdots & 1\\\vdots & \vdots & \ddots & \vdots\\1 & 1 & \cdots & 1\end{bmatrix}\\ =X-\begin{bmatrix} \frac{1}{n}\sum_{i=1}^n x_i^1 & \frac{1}{n}\sum_{i=1}^n x_i^1 & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^1 \\ \frac{1}{n}\sum_{i=1}^n x_i^2 & \frac{1}{n}\sum_{i=1}^n x_i^2 & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^2 \\ \vdots & \vdots & \ddots & \vdots \\ \frac{1}{n}\sum_{i=1}^n x_i^d & \frac{1}{n}\sum_{i=1}^n x_i^d & \ldots & \frac{1}{n}\sum_{i=1}^n x_i^d\end{bmatrix}=X-\it{M}$

Here, $x_i^j$ stands for the $j$-th element of the $i$-th datum. Noting that every column of the matrix $\it{M}$ are identical, besides, they coincide with the mean $\mu$. We here rewrite $\tilde{X}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}$, and we can henceforth have:

\[S_{t}=\sum_{i=1}^{n}\left(x_{i}-\mu\right)\left(x_{i}-\mu\right)^{T}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}\begin{bmatrix} (x_1-\mu)^T \\\\ (x_2-\mu)^T \\\\ \vdots \\\\ (x_n-\mu)^T \end{bmatrix}=\tilde{X} \tilde{X}^{T},\]

which completes the proof.

3. Between class scatter matrix

property 2.

Between class scatter matrix $S_{b}=\sum_{i=1}^{c} n_{i}\left(\mu_{i}-\mu\right)\left(\mu_{i}-\mu\right)^{T}=\tilde{X} G G^{T} \tilde{X}^{T}$, where $\mu_i$ is the mean of samples in the $i$-th class, and $G = \begin{bmatrix}G_{1} & \ldots & G_{n}\end{bmatrix}^{T}=Y\left(Y^{T} Y\right)^{-1 / 2}$ is the scaled label matrix.

Proof:

We first insert $G =Y\left(Y^{T} Y\right)^{-1 / 2}$ into the objective formula, and rewrite $S_{b}=\tilde{X} Y\left(Y^{T} Y\right)^{-1 / 2} \left(Y^{T} Y\right)^{-T / 2}Y^{T} \tilde{X}^{T}=\tilde{X} Y\left(Y^{T} Y\right)^{-1}Y^{T} \tilde{X}^{T}=\tilde{X} Y\left(Y^{T} Y\right)^{-1}(\tilde{X}Y)^{T}$. We now divide the formula into 2 parts and analysis them respectively.

1. $\tilde{X} Y$ part.

According to the section 2, we have that $\tilde{X}=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}$, hence, $\tilde{X} Y=\begin{bmatrix} (x_1-\mu) & (x_2-\mu) & \ldots & (x_n-\mu) \end{bmatrix}\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$. Since the column vector $y_i$ represents the assigned cluster of the $i$-th datum, then, the $y_i^T$ is a row vector whose elements are all 0s expect an 1 at the corresponding index which is coincide with the index of the cluster to which $x_i$ belongs. So, consider 2 rows $y_i^T,~y_j^T$in the matrix $\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$, if $x_i$ and $x_j$ belong to the $o$-th cluster $c_o$ simultaneously, they are identical, with their 1s at the $o$-th index, i.e., the $o$-th column of the matrix. Further more, consider all the samples belong to the $o$-th cluster $c_o$, they all contribute their 1s to the $o$-th column of the matrix and the other samples that are not belong to the cluster $c_o$ contribute 0s to the $o$-th column of the matrix. Now let’s change perspective to the column vector of the matrix $\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \\ \end{bmatrix}=\begin{bmatrix} y^1 & y^2 & \ldots & y^c \end{bmatrix}$, it is easy to find out that the $j$-th column $y^j$ is a vector with length $n$, and it represents the indicator vector of the cluster $j$ indicating all the samples that belong to the $j$-th cluster. That is, if 3 samples $x_d,~x_e,~x_f$ belong to the $j$-th cluster, then, the $d$-th, $e$-th and the $f$-th elements of $y^j$ are 1, and as for the other indices, they take value 0. From this point of the view, the matrix $\tilde{X} Y$ has it’s special meaning. Consider the element $(\tilde{X} Y)_ {ij}=\begin{bmatrix} x_ 1^i & x_ 2^i & \ldots & x_ n^i \end{bmatrix}y^ j$, recall that $y^ j$ encodes the samples which belong to the $j$-th cluster $c_j$ in it’s index, then $(\tilde{X} Y)_ {ij}=\sum_ {x_ b \in c_j}(x_ b^i-\mu_ i)$. Extend the above to the whole matrix, we can easily turn out that $\tilde{X} Y=\begin{bmatrix} \sum_ {x_ b \in c_1}^n (x_ b^1-\mu_ 1) & \sum_ {x_ b \in c_2} (x_ b^1-\mu_ 1) & \ldots & \sum_ {x_ b \in c_c} (x_ b^1-\mu_ 1) \\ \sum_ {x_ b \in c_1} (x_ b^2-\mu_ 2) & \sum_ {x_ b \in c_2} (x_ b^2-\mu_ 2) & \ldots & \sum_ {x_ b \in c_c} (x_ b^2-\mu_ 2) \\ \vdots & \vdots & \ddots & \vdots \\ \sum_ {x_ b \in c_1} (x_ b^d-\mu_ d) & \sum_ {x_ b \in c_2} (x_ b^d-\mu_ d) & \ldots & \sum_ {x_ b \in c_c} (x_ b^d-\mu_ d)\end{bmatrix}\\=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix} \in \Bbb{R}^{d \times c}.$

2. $\left(Y^{T} Y\right)^{-1}$ part.

We first consider the matrix $Y^{T} Y$. Since $Y=\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}$, hence $Y^{T} Y=\begin{bmatrix} y_ 1 & y_ 2 & \ldots & y_ n \end{bmatrix}\begin{bmatrix} y_1^T \\ y_2^T \\ \vdots \\ y_n^T \end{bmatrix}=\sum_ {i=1}^n y_ i y_i^T$. Consider the term $y_ i y_i^T$ in the summation, since $y_i$ represents the index of the cluster that sample $x_ i$ belongs to, i.e., if $x_ i $ belongs to the $j$-th cluster, then $y_ i^j=1$ and $y_ i^k=0\,for\text{ }all\text{ }k \neq j$, which means, $y_i=\begin{bmatrix} 0 \\0 \\ \vdots \\ 0 \\ \vdots \\ 0 \\ \vdots \\0 \\\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }1\leftarrow j\text{-th} \\0 \\ \vdots \\ 0 \\ 0 \end{bmatrix}$. Now, the outer product of $y_ i$ and itself $y_ i y_i^T$ is a matrix who has only one non-zero element of value 1 at $(j,j)$ position(must located at the diagonal) and all 0s elsewhere. That is, $y_ i y_i^T=\begin{bmatrix}0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }1\leftarrow (j,j) & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 &0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \end{bmatrix}$. Consider different outer products $y_ a y_a^T$ and $y_ b y_b^T$, if samples $a$ and $b$ belong to the same cluster, namely the $l$-th cluster, then, $y_ a y_a^T+y_ b y_b^T$ is a matrix who has only one non-zero element of value 2 at $(l,l)$ position, and all 0s elsewhere. Further more, consider all the samples belong to the $l$-th cluster simultaneously, the summation of the outer products of the feature vectors of these samples themselves is a matrix who has only one non-zero element of value $\sum_ {x_b \in c_ l}1$, i.e., the size of the $l$-th cluster, at $(l,l)$ position, and all 0s elsewhere. That is, $\sum_ {x_b \in c_ l}y_ b y_ b ^T=\begin{bmatrix}0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\sum_ {x_b \in c_ l}1\leftarrow (j,j) & 0 &\ldots & 0 \\0 & 0 & \cdots & 0 & 0 &0 &\ldots & 0 \\\vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\0 & 0 & \cdots & 0 & 0 & 0 &\ldots & 0 \end{bmatrix}$ From now on, we can easily extend the above to the total summation of the outer products of the feature vectors of all these samples themselves. $Y^{T} Y=\sum_ {i=1}^n y_ i y_i^T =\sum_ {x_b \in c_ 1}y_ b y_ b ^T+\sum_ {x_b \in c_ 2}y_ b y_ b ^T+\dots+\sum_ {x_b \in c_ c}y_ b y_ b ^T\\=\begin{bmatrix} \sum_ {x_ b \in c_1} 1 & 0 & \ldots & 0 \\ 0 & \sum_ {x_ b \in c_2} 1 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \sum_ {x_ b \in c_c} 1\end{bmatrix}=\begin{bmatrix} n_ 1 & 0 & \ldots & 0 \\ 0 & n_ 2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & n_ c\end{bmatrix}.$

Here, $n_i=\sum_ {x_ b \in c_i} 1$ denotes the size of the $i$-th cluster. Since $Y^{T} Y$ is a diagonal matrix, the inverse of $Y^{T} Y$, namely $(Y^{T} Y)^ {-1}$, is also a diagonal matrix, and $(Y^{T} Y)^ {-1}=\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}$.

3. Aggregation

Since $\tilde{X} Y=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix} $ and $(Y^{T} Y)^ {-1}=\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}$, then,

$\tilde{X} Y\left(Y^{T} Y\right)^{-1}(\tilde{X}Y)^{T}\\=\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu) & \sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix}\begin{bmatrix} \frac{1}{n_ 1} & 0 & \ldots & 0 \\ 0 & \frac{1}{n_ 2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots &\frac{1}{n_ c}\end{bmatrix}\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu)^ T \\ \sum_ {x_ b \in c_2} (x_ b-\mu)^ T \\ \vdots \\ \sum_ {x_ b \in c_c} (x_ b-\mu)^ T \end{bmatrix}\\=\begin{bmatrix} \frac{1}{n_ 1}\sum_ {x_ b \in c_1} (x_ b-\mu) & \frac{1}{n_ 2}\sum_ {x_ b \in c_2} (x_ b-\mu) & \ldots & \frac{1}{n_ c}\sum_ {x_ b \in c_c} (x_ b-\mu) \end{bmatrix}\begin{bmatrix} \sum_ {x_ b \in c_1} (x_ b-\mu)^ T \\ \sum_ {x_ b \in c_2} (x_ b-\mu)^ T \\ \vdots \\ \sum_ {x_ b \in c_c} (x_ b-\mu)^ T \end{bmatrix}\\=\frac{1}{n_ 1}\sum_ {x_ b \in c_1} (x_ b-\mu)\sum_ {x_ b \in c_1} (x_ b-\mu) ^ T+\frac{1}{n_ 2}\sum_ {x_ b \in c_2} (x_ b-\mu)\sum_ {x_ b \in c_2} (x_ b-\mu) ^ T\\+\ldots+\frac{1}{n_ c}\sum_ {x_ b \in c_c} (x_ b-\mu)\sum_ {x_ b \in c_c} (x_ b-\mu) ^ T\\=\frac{n_ 1 ^ 2}{n_ 1}\sum_ {x_ b \in c_1} \frac{(x_ b-\mu)}{n_ 1}\sum_ {x_ b \in c_1} \frac{(x_ b-\mu)^ T}{n_ 1} +\frac{n_ 2 ^ 2}{n_ 2}\sum_ {x_ b \in c_2} \frac{(x_ b-\mu)}{n_ 2}\sum_ {x_ b \in c_2} \frac{(x_ b-\mu)^ T}{n_ 2}\\+\ldots+\frac{n_ c ^ 2}{n_ c}\sum_ {x_ b \in c_c} \frac{(x_ b-\mu)}{n_ c}\sum_ {x_ b \in c_c} \frac{(x_ b-\mu)^ T}{n_ c}\\=n_ 1(\mu_1 - \mu)(\mu_1 - \mu)^ T+n_ 2(\mu_2 - \mu)(\mu_2 - \mu)^ T+\ldots+n_ c(\mu_c - \mu)(\mu_c - \mu)^ T\\=\sum_ {i=1} ^ c n_ i(\mu_ i - \mu)(\mu_ i - \mu)^ T\\=S_{b}$

which completes the proof.

$\ell_{2,1}$-Norm Regularized Discriminative Feature Selection for Unsupervised Learning

derivation of some formulars

1. Notations

2. Total scatter matrix

Property 1.

Proof:

3. Between class scatter matrix

property 2.

Proof:

1. $\tilde{X} Y$ part.

2. $\left(Y^{T} Y\right)^{-1}$ part.

3. Aggregation

☘ 还没看完看完了再决定有没有值得写的

CATALOG

FEATURED TAGS

1. Notations

2. Total scatter matrix

Property 1.

Proof:

3. Between class scatter matrix

property 2.

Proof:

1. $\tilde{X} Y$ part.

2. $\left(Y^{T} Y\right)^{-1}$ part.

3. Aggregation

☘ 还没看完 看完了再决定有没有值得写的

CATALOG

FEATURED TAGS

☘ 还没看完看完了再决定有没有值得写的