class: center, middle, inverse, title-slide # STAT3622 Data Visualization (Lecture 2) ## Exploratory Data Analysis ###
Dr. Aijun Zhang
The University of Hong Kong ### 4 February 2020 --- # What's covered in this lecture? <img style="float: right; width: 300px; padding:10px 100px 0 0;" src="LogoEDA.jpg"> I. Exploratory Data Analysis - John Tukey - Exploratory Data Analysis II. Simple Base Graphics - Iris Dataset - Basic R Plots III. Using R:Lattice Package - Conditioning and Grouping - Cloud and Level Plots --- class: center, middle # I. Exploratory Data Analysis --- # Review: Data Science Workflow <img src="DSWorkflow.jpg" width="66%" style="display: block; margin: auto;" /> --- # Roles of Data Visualization - Role 1: Exploratory data analysis (pre stage); - Role 2: Visual presentation of results (after stage). - John W. Tukey (1977; Exploratory Data Analysis): "The greatest value of a picture is when it forces us to notice what we never expected to see.” .pull-left[ <img src="JohnTukey.png" width="47%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="JohnTukeyEDA.jpg" width="40%" style="display: block; margin: auto;" /> ] --- # John Tukey (1915-2000) <img style="float: left; width: 300px; margin-top: 20px; margin-right:80px; " src="JohnTukey.png"> - Proposed “Exploratory Data Analysis” - Coined terms: Boxplot, Stem-and-Leaf plot, ANOVA (Analysis of Variance) - Coined terms “Bit” and “Software” - Co-Developed Fast Fourier Transform algorithm, Projection Pursuit, Jackknife estimation - Famous quote: “The best thing about being a statistician is that you get to play in everyone's backyard.” - https://en.wikipedia.org/wiki/John_Tukey --- # John Tukey: The Future of Data Analysis (1962) <img src="JohnTukey1962.png" width="800px" style="display: block; margin: auto;" /> - Reference: [Donoho, David (2017). "50 Years of Data Science" at *JCGS*, **26**(4), 745-766.](https://www.tandfonline.com/doi/abs/10.1080/10618600.2017.1384734) --- # John Tukey: Exploratory Data Analysis (1977) <img style="float: left; width: 300px; margin-right:50px; " src="JohnTukeyEDA.jpg"> <img style="float: right; width: 380px; margin-left:10px; " src="JohnTukeyEDAplots.jpg"> - Five-number summary - Stem-and-Leaf plot - Scatter plot - Box-plot, Outliers - Residual plot - Smoother - Bag plot --- # Example: Anscombe Dataset ``` ## Anscombe Dataset: ``` <table> <thead> <tr> <th style="text-align:right;"> x1 </th> <th style="text-align:right;"> y1 </th> <th style="text-align:right;"> x2 </th> <th style="text-align:right;"> y2 </th> <th style="text-align:right;"> x3 </th> <th style="text-align:right;"> y3 </th> <th style="text-align:right;"> x4 </th> <th style="text-align:right;"> y4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 8.04 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 9.14 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 7.46 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.58 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.95 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.14 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.77 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.76 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 7.58 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 8.74 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 12.74 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 7.71 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8.81 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8.77 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 7.11 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.84 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 8.33 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 9.26 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 7.81 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8.47 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 9.96 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 8.10 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 8.84 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 7.04 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7.24 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6.13 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6.08 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.25 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4.26 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3.10 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 5.39 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 12.50 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 10.84 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 9.13 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 8.15 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 5.56 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 4.82 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7.26 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 6.42 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 7.91 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5.68 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4.74 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 5.73 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6.89 </td> </tr> </tbody> </table> Source: Anscombe, F. J. (1973). Graphs in statistical analysis. *American Statistician*, **27**, 17-21. --- # Example: Anscombe Dataset (Descriptive) ``` ## Mean and standard deviation: ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> x1 </th> <th style="text-align:right;"> y1 </th> <th style="text-align:right;"> x2 </th> <th style="text-align:right;"> y2 </th> <th style="text-align:right;"> x3 </th> <th style="text-align:right;"> y3 </th> <th style="text-align:right;"> x4 </th> <th style="text-align:right;"> y4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:right;"> 9.00 </td> <td style="text-align:right;"> 7.50 </td> <td style="text-align:right;"> 9.00 </td> <td style="text-align:right;"> 7.50 </td> <td style="text-align:right;"> 9.00 </td> <td style="text-align:right;"> 7.50 </td> <td style="text-align:right;"> 9.00 </td> <td style="text-align:right;"> 7.50 </td> </tr> <tr> <td style="text-align:left;"> sd </td> <td style="text-align:right;"> 3.32 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 3.32 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 3.32 </td> <td style="text-align:right;"> 2.03 </td> <td style="text-align:right;"> 3.32 </td> <td style="text-align:right;"> 2.03 </td> </tr> </tbody> </table> ``` ## x-y correlation: ``` <table> <thead> <tr> <th style="text-align:right;"> rho1 </th> <th style="text-align:right;"> rho2 </th> <th style="text-align:right;"> rho3 </th> <th style="text-align:right;"> rho4 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 0.82 </td> <td style="text-align:right;"> 0.82 </td> </tr> </tbody> </table> --- # Example: Anscombe Dataset (Graphic) <img src="index_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Exploratory Data Analysis The EDA is a statistical approach to make sense of data by using a variety of techniques (mostly graphical). It may help us - Assess assumption about variables distribution - Identify relationship between variables - Extract important variables - Suggest use of appropriate models - Detect problems of collected data (e.g. outliers, missing data, measurement errors) --- # Statistical Graphics - **Univarite** * Histogram, Stem-and-Leaf, Dot, Q-Q, Density plots * Boxplot, Box-and-whisker * Bar, Pie, Polar, Waterfall charts - **Bivariate** * XYplot, Line, Area, Scatter, Bubble charts - **Trivariate** * 3D Scatter, Contour, Level/Heatmap, Surface plots --- # Which Chart to Use? <img src="EDA_ChartSuggestion.jpg" width="700px" style="display: block; margin: auto;" /> --- class: center, middle # II. Simple Base Graphics --- # Iris Dataset <img src="IrisFlower.png" width="600px" style="display: block; margin: auto;" /> ```r DataX = iris # ?iris str(DataX) ``` ``` ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` --- ```r dim(DataX) ``` ``` ## [1] 150 5 ``` ```r head(DataX) # tail ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ``` ```r summary(DataX) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ``` --- # Basic R Plots: Histogram and Density Plot ```r x = DataX$Sepal.Length # a continuous variable par(mfrow=c(1,3)) hist(x, main='Histogram (Default)') hist(x, breaks=20, col=5, main='More Bins and Coloring') hist(x, breaks=20, freq=F, main='Histogram plus Density Plot') # using freq=FALSE lines(density(x), col=2, lty=1, lwd=2) #add the density curve ``` <img src="index_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Basic R Plots: Boxplot ```r par(mfrow=c(1,3)) boxplot(DataX$Sepal.Width, main='Boxplot of Sepal.Width') # Outliers boxplot(DataX[,1:4], col=c(2,3,4,5), main='Side-by-side Boxplot') boxplot(Sepal.Width~Species, DataX, col=c(6,7,8), main="Boxplot with Grouping") ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Basic R Plots: Pie and Bar Charts ```r DataX$Flag = DataX$Sepal.Length>5 # Create a binary flag par(mfrow=c(1,3)) pie(table(DataX$Species[DataX$Flag]), col=c(2,3,4)) barplot(table(DataX$Species[DataX$Flag]), col=c(5,6,7)) barplot(table(DataX$Species, DataX$Flag), col=c(5,6,7), beside=T) ``` <img src="index_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Relationship Between Variables ```r x = DataX$Petal.Length; y = DataX$Petal.Width; z = DataX$Species par(mfrow=c(1,2)); par(mar=c(4,4,1,4)) plot(x, y, xlab="Petal.Length", ylab="Petal.Width") abline(coef(lm(y~x)), col=1, lty=2) plot(x, y, col=c(2,3,4)[z], pch=20, cex=2.0, xlab="Petal.Length", ylab="Petal.Width") abline(lm(y~x), col=1, lty=2) legend("topleft", levels(z), pch=20, col=c(2,3,4)) ``` <img src="index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Pairwise Scatter Plot .pull-left[ ```r plot(DataX, col=DataX$Species, main="Pairwise Scatter Plot") ``` <img src="index_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r pairs(DataX[,1:4], panel = panel.smooth, col = c(4,5,6)[DataX$Species], main="More Sophisticated") ``` <img src="index_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # III. Using R:Lattice Package --- # R:Lattice <img style="float: left; width: 240px; margin-top: 20px; margin-right:80px; " src="BookLattice.png"> - Using trellis graphs for multivariate data - Multipanel conditioning and grouping - Elegant high-level data visualization - Covering most of statistical charts - Figures and Codes can be found at http://lmdvr.r-forge.r-project.org/ - However, plot customization are not so straightforward --- # Univariate Distributions ```r library(lattice); library(gridExtra) p1 = histogram(DataX$Sepal.Length) p2 = bwplot(DataX$Sepal.Length) grid.arrange(p1, p2, ncol=2) ``` <img src="index_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Histogram with Conditioning ```r histogram(data=DataX, ~Sepal.Length|Species, breaks=12, layout = c(3, 1)) ``` <img src="index_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Density plot with Grouping ```r densityplot(data=DataX, ~Sepal.Length, groups=Species, plot.points=F, auto.key=list(space="top", columns=3)) ``` <img src="index_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- # Boxplot with Grouping ```r bwplot(data=DataX, Sepal.Width~Species) ``` <img src="index_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # Bivariate plot with Grouping ```r xyplot(data=DataX, Sepal.Length ~ Petal.Length, groups = Species, type = c("p", "smooth", "g"), auto.key = list(space="top", columns=3)) # grouping ``` <img src="index_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- # Bivariate plot with Conditioning ```r xyplot(data=DataX, Sepal.Length ~ Petal.Length|Species, type=c("p", "smooth", "g"), layout=c(1,3)) # conditioning ``` <img src="index_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- # Trivariate 3D Plot ```r cloud(data=DataX, Sepal.Length ~ Sepal.Length * Petal.Width, groups = Species, auto.key = list(space="top", columns=3), panel.aspect = 0.8) ``` <img src="index_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Trivariate Heatmap ```r dist = as.matrix(dist(DataX[,3:4])) levelplot(dist, colorkey = T, col.regions = terrain.colors, scales = list(at=c(0,0),tck = c(0,0)), xlab="",ylab="",main="Levelplot of Pairwise Distance Matrix") ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- class: center, middle # Thank you! Q&A or Email ajzhang@umich.edu。