Course of
Data Analysis and Exploration 
Diary of the course 2012/13

  • 17/9. Introduction to the course. Prerequisites in statistics. Introduction to the use of R. Script used in class
  • 19/9 Vectors, matrices, strings, data frames. Script used in class. Data files to be used in class: pressione and cathedral .
  • 24/9 Using data frames. Simple graphics. Probability distributions Script used in class
  • 26/9 Graphical approches to univariate and bivariate data. An introduction to estimation and confidence intervals Script used in class
  • 1/10 Mathematical ideas and methods in least square linear regression. See short notes on the mathematics of linear models. The content is basically that of Chapter 2 of the book by Faraway
  • 3/10 Linear models implemented in R. Script used in class with examples of simple regression, analysis of variance of multiple regression.
  • 8/10 Quick summary of theory and practice of hypothesis testing.
  • 10/10 Confidence intervals and power of tests on simulated data. Analysis of covariance Script used in class.
  • 17/10 2-way analysis of variance with and without interaction. Polynomial regression. Introduction to model selectionScript used in class.
  • 22/10 Aims of regression models: understanding and/or predicting. When is hypothesis testing appropriate. Prediction error: idea of cross-validation. Model selection by using a criterion: adjusted R-square; Akaike criterion (AIC): a quick motivation, use in linear models, and possible modifications (BIC and AICC). Selection methods applied to the state dataset. Some slides (thanks to Samantha Riccadonna, FBK) on model selection. See an introductory article on AIC and BIC theory.
  • 24/10 More diagnostics on output of linear models: leverage, studentized residual. Weighted least squares. Testing for correlation in residuals. How to save graphical output. Script used in class.
  • 29/10 Principal component analysis. Mathematical basis of the method. Script used in class.
  • 31/10 Other examples of principal component analysis. Biplot, regression on principal component scores Script used in class. Data files used: pollution in US cities.
  • Part of the course used for Statistical methods ends here

  • 5/11 Theory of generalized linear models (a very concise treatment, including implementation in R is at Chapter 7 of Modern Applied Statistics with S by Venables and Ripley, one of the reserved books for the course; the standard reference is McCullagh and Nelder "Generalized linear models", Chapman and Hall, available at the central university library. See a short summary of generalized linear models
  • 7/11 Logistic regression in R using the command glm. Script used in class.
  • 14/11 Diagnostics on logistic regression. Generalized linear model using Poisson and negative binomial distributions. Script used in class. A short note on the negative_binomial distribution.
  • 19/11 Multidimensional scaling: theory and examples. Use of minimum spanning tree. Script used in class. Data files that can be useful: air distances between US airports, coordinates of US airports, morphological distances between populations of voles, Arvicola spp..
  • 21/11 Exercises on multidimensional scaling and simulation of bivariate random numbers. Last part improved relative to what shown in class.
  • 26/11 Correspondence analysis; basic theory: see notes (in Italian) to be updated. Script used in class.
  • 28/11 Exercises on correspondence and multiple correspondence analysis. Something still to be improved .
  • 3/12 Basic theory on linear discriminant analysis: see notes (in Italian) to be updated.
  • 5/12 Examples of use of linear discriminant analysis.
  • 10/12 Another example of use of linear discriminant analysis
  • 12/12 Some example of use of commands to produce graphs, including some from "lattice" package.