class: center, middle, inverse, title-slide # STAT3622 Data Visualization (Lecture 12) ## Big Data Visualization ###
Dr. Aijun Zhang
The University of Hong Kong ### 28 April 2020 --- # Big Data Visualization <img style="float: right; width: 360px; padding:0 0 0 50px;" src="SketchBigData.png"> Big data often come with two distinct features: - Big n (number of observations): large-scale - Large p (number of features): high-dimensional <br> Challenging tasks when exploring big data in the context of unsupervised learning - Large-scale clustering - Dimension reduction (when p is large) --- # Two BigDataViz Approaches <img style="float: right; width: 350px; padding:0 10px 0 0;" src="tSNE_SubDSF.png"> In this lecture, we discuss two different approaches for big data visualization: - Dimension Reduction - PCA-based K-means Clustering - t-SNE Visualization - Subsampled Data Exploration --- class: center, middle # Dimension Reduction ### From PCA to t-SNE --- # Principal Component Analysis PCA is to project the data to a new coordinate system such that the greatest variance lies on the first coordinate (i.e. the first principal component), the second greatest variance on the second principal component, and so so. <img src="PCA_wiki.png" width="600px" style="display: block; margin: auto;" /> --- # Principal Component Analysis .pull-left[ ```r DataX = data.frame(x1 = iris$Sepal.Length, x2 = iris$Petal.Width) (tmp = prcomp(DataX)) ``` ``` ## Standard deviations (1, .., p=2): ## [1] 1.0734371 0.3382787 ## ## Rotation (n x k) = (2 x 2): ## PC1 PC2 ## x1 0.7419133 -0.6704958 ## x2 0.6704958 0.7419133 ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # PCA-based K-means Clustering ```r pr = prcomp(iris[,1:4]) # Use all four variables par(mfrow=c(1,2), mar=rep(4,4)) barplot(pr$sdev^2/sum(pr$sdev^2), xlab="PC", ylab="Percentage", main="Variance Explained") PCx = pr$x[,1:2]; kk = 3; set.seed(1); fit = kmeans(PCx, kk) plot(PCx[,1], PCx[,2], pch=19, cex=1, col=seq(2,1+kk)[fit$cluster], xlab="PC1", ylab="PC2", main="K-means on PC scores") points(fit$centers, pch=10, cex=2, col=seq(2,1+kk)) ``` <img src="index_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # t-Distributed Stochastic Neighbor Embedding - Considering a Gaussian distribution around `\(x_i\)` with a given variance `\(\sigma_i^2\)`, where `\(\sigma_i\)` is smaller for the points in the sense areas than the points in the sparse areas. - Similarity index: `\(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}\)`, based on the conditional simlarity between data points: <img src="Formula_tSNE.png" width="300px" style="display: block; margin: auto;" /> <!-- `$$p_{j|i} = \frac{\exp\left(-\left| x_i - x_j\right|^2 \big/ 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\left(-\left| x_i - x_k\right|^2 \big/ 2\sigma_i^2\right)}$$` --> - t-SNE uses the heavy-tailed t-Student distribution with one degree of freedom (i.e. Cauchy distribution) instead of a Gaussian distribution. - t-SNE algorithm determines the locations of point maps in 2D by minimzing the Kullack-Leibler divergence using the gradient descent. <img src="Formula_KLdiv.png" width="300px" style="display: block; margin: auto;" /> --- # t-SNE illustrated <img src="tSNE_animation.gif" width="420px" style="display: block; margin: auto;" /> Source: [An illustrated introduction to the t-SNE algorithm](https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm) See also: [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/) --- ```r library(Rtsne) iris_unique <- unique(iris) # Remove duplicates iris_matrix <- as.matrix(iris_unique[,1:4]) par(mar=c(2,2,2,2)); set.seed(1) tsne_out <- Rtsne(iris_matrix,pca=FALSE,perplexity=30,theta=0.0) # Run TSNE plot(tsne_out$Y, col=as.numeric(iris_unique$Species)+1) ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # t-SNE on a large MNIST data with 60K sample - Refer to STAT3612-Lecture 12 [MNIST Case Study](http://www.statsoft.org/wp-content/uploads/2018Stat3612/Lecture12_CaseMNIST/Lecture12_CaseMNIST.html) about the data background. <img src="TSNE_MNIST.png" width="600px" style="display: block; margin: auto;" /> --- class: center, middle # 2: Subsampled Data Exploration <br> Click for a [recent presentation](20181110DSD_Nankai.pdf) --- class: center, middle # Thank you! Q&A or Email ajzhang@umich.edu。