STAT3622 Data Visualization (Lecture 12)

class: center, middle, inverse, title-slide

# STAT3622 Data Visualization (Lecture 12)
## Big Data Visualization
### <br>Dr. Aijun Zhang<br> The University of Hong Kong
### 28 April 2020

---

# Big Data Visualization

Big data often come with two distinct features:

- Big n (number of observations): large-scale

- Large p (number of features): high-dimensional

<br>

Challenging tasks when exploring big data in the context of unsupervised learning

- Large-scale clustering

- Dimension reduction (when p is large)

---
# Two BigDataViz Approaches

In this lecture, we discuss two different approaches for big data visualization:

- Dimension Reduction
  
  - PCA-based K-means Clustering
  
  - t-SNE Visualization
  
- Subsampled Data Exploration

---
class: center, middle

# Dimension Reduction

### From PCA to t-SNE

---
# Principal Component Analysis

PCA is to project the data to a new coordinate system such that the greatest variance lies on the first coordinate (i.e. the first principal component), the second greatest variance on the second principal component, and so so.

---
# Principal Component Analysis

.pull-left[

```r
DataX = data.frame(x1 = iris$Sepal.Length, 
                   x2 = iris$Petal.Width)
(tmp = prcomp(DataX))
```

```
## Standard deviations (1, .., p=2):
## [1] 1.0734371 0.3382787
## 
## Rotation (n x k) = (2 x 2):
##          PC1        PC2
## x1 0.7419133 -0.6704958
## x2 0.6704958  0.7419133
```
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
# PCA-based K-means Clustering

```r
pr = prcomp(iris[,1:4]) # Use all four variables 
par(mfrow=c(1,2), mar=rep(4,4))
barplot(pr$sdev^2/sum(pr$sdev^2), xlab="PC", ylab="Percentage", main="Variance Explained")
PCx = pr$x[,1:2]; kk = 3; set.seed(1); fit = kmeans(PCx, kk)
plot(PCx[,1], PCx[,2], pch=19, cex=1, col=seq(2,1+kk)[fit$cluster], 
     xlab="PC1", ylab="PC2", main="K-means on PC scores")
points(fit$centers, pch=10, cex=2, col=seq(2,1+kk))
```

---
# t-Distributed Stochastic Neighbor Embedding

- Considering a Gaussian distribution around `$x_i$` with a given variance `$\sigma_i^2$`, where `$\sigma_i$` is smaller for the points in the sense areas than the points in the sparse areas.

- Similarity index: `$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$`, based on the conditional simlarity between data points: 
<img src="Formula_tSNE.png" width="300px" style="display: block; margin: auto;" />

- t-SNE uses the heavy-tailed t-Student distribution with one degree of freedom (i.e. Cauchy distribution) instead of a Gaussian distribution.

- t-SNE algorithm determines the locations of point maps in 2D by minimzing the Kullack-Leibler divergence using the gradient descent. 
<img src="Formula_KLdiv.png" width="300px" style="display: block; margin: auto;" />

---
# t-SNE illustrated

<img src="tSNE_animation.gif" width="420px" style="display: block; margin: auto;" />
Source:  [An illustrated introduction to the t-SNE algorithm](https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm)

See also: [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)

---

```r
library(Rtsne)
iris_unique <- unique(iris) # Remove duplicates
iris_matrix <- as.matrix(iris_unique[,1:4])
par(mar=c(2,2,2,2)); set.seed(1)
tsne_out <- Rtsne(iris_matrix,pca=FALSE,perplexity=30,theta=0.0) # Run TSNE
plot(tsne_out$Y, col=as.numeric(iris_unique$Species)+1)
```

---
# t-SNE on a large MNIST data with 60K sample

- Refer to STAT3612-Lecture 12 [MNIST Case Study](http://www.statsoft.org/wp-content/uploads/2018Stat3612/Lecture12_CaseMNIST/Lecture12_CaseMNIST.html) about the data background.

---
class: center, middle

# 2: Subsampled Data Exploration

<br>

Click for a [recent presentation](20181110DSD_Nankai.pdf)

---
class: center, middle

# Thank you!

Q&A or Email ajzhang@umich.edu。