550.400 Mathematical Modeling and Consulting
Spring 2010
Naiman
Lecture notes
Lecture 1 - 1/25/10
Overview of the course, syllabus. Getting acquainted.
R installation
Getting started with R:
writing scripts and passing commands to the console
creating a vector of numbers, strings or TRUE/FALSE values
the objects() command
the history() command
copying the R icon to a folder
setting up the "Start in" folder
leaving R and returning
organizing projects in folders
Lecture 2 - 1/27/10
Reading assignment: Read Chapters 1 & 2 in Introduction to R by Friday Chapters 3 & 4 by Monday
Some statistical concepts discussed:
Properties of data vs. properties of idealized model that describes how data are generated
empirical cdf for data
quantile function for data
Statistical model: data come from an exponential distribution:
cdf: F(x) = 1-exp(-lambda*x) for x>0
pdf: f(x) = lambda*exp(-lambda*x) for x>0
fitting based on data by estimating lambda: lambda-hat=1/mean of data
If we sample a distribution F to get X1,X2,.... then F(X1),F(X2),... should look like a Uniform(0,1) sample
Working with R
- working directory
- getwd()
- setwd()
- dir()
- source()
- getting help
- ?
- help.search() & appropos
- http://rseek.org/
- using R editor to scripts and run commands
- objects
- listing
- creating
- removing
- typeof
- saving workspace
- recalling previous commands
- up/down arrow keys
- Editing the Rconsole file: on my Windows 7 system to be found at C:\Program Files (x86)\R\R-2.10.1\etc
- arithmetic operations on vectors
- transformations of vectors
- exp()
- sqrt()
- log()
- combining vectors
- random number generation
- rnorm()
- runif()
- internal datasets
- sunspots
- elementary data summaries
- sum()
- mean()
- var(x)
- sd(x)
- max(x)
- quantile(x)
- hist(x)
- plot(x)
Lecture 3 1/29/10
The sample() command
The (non-parametric) boostrap method for determining sampling variability for an estimator (illustrated with an exponential rate parameter estimate).
Working with R
- lists
- regular sequences
- seq() and :
- rep()
- parameters in functions - e.g. seq(), hist(), rep()
- logical
- logic for individual values
- logic for vectors
- missing values
- coding missing values
- checking for missing with is.na()
- the na.rm option
- names in vectors
- selecting elements a vector using x[v], v a vector & assignment
- v a vector of positive integers
- v a vector of negative integers
- v a logical vector
- names of vector elements
Lecture 4 2/1/10
Reading assignment: Read Chapters 6 & 7 by Friday 2/5/10
Working with R
- is
- is.vector()
- is.character()
- is.logical()
- is.numeric()
- is.list()
- more about lists
- [[ ]] vs [ ] for lists
- abbreviating names in lists
- adding an element to a list
- method 1: using [[ ]]
- method 2: using [ ]
- concatenating lists using c()
- matrices
- creating
- subscripting
- dimension of
- picking out rows & columns
- submatrices
- transposing
- data frames
- creating from a list of columns
- the read.table() command
- from a file
- from the internet
- read.csv()
Lecture 5 2/3/10
Working with R
- search path
- search()
- attach() detach()
- categorical data
- factors
- the table() command & cross-tabulation
- reading the bible
- more on the scan() command
- strsplit()
- unlist()
- more on subscripting - the Monte-Carlo method
- a conditional probability - the truncated normal distribution
- a conditional expectation
- normal distrubution
- rnorm()
- qnorm()
- dnorm()
- pnorm()
- QQplots & qqnorm()
Lecture 6 2/5/10
Lecture 7 2/15/10
Reading assignment - Chapters 7 & 8 by Wednesday
Quiz Friday - sample questions will be posted on Wednesday
pop data for states (link to Wikipedia)
.csv file - after importing into Excel, copy & pasting table, exporting to .csv file
tax data for states (link to Wikipedia)
.csv file - after importing into Excel, copy & pasting table, exporting to .csv file
- merging data frames - merge() with applications
- simple example height/weight data
- state population vs taxes
- NHANES - merging lab & demographic data
- substrings using the substr() command
- converting strings to numeric values with as.numeric()
- plotting all pairs using the pairs() command
- boxplot()
- two sample tests
- t.test()
- wilcox.test()
Lecture 8 2/17/10
Lecture 9 2/19/10
Linear models continued
- the no-intercept specification
- multiple predictor variables
- including quadratic terms
- what is an additive model?
- what are interactions?
- categorical variables as predictors
- converting to factor type using as.factor()
Lecture 10 2/22/10
Notes on normal distributions & the linear model - pdf
Univariate normal distribution
- mean and variance
- if X has a normal distribution, what is the distribution of aX+b?
- how the chi-square distribution arises
Bivariate normal distribution
- mean vector & covariance matrix
- linear combinations have univariate normal distributions
- how to sample the bivariate normal distribution
- how the chi-square distribution arises
Multivariate normal distribution
The linear model in matrix form
- mean and variance
- linear combinations have univariate normal distributions
- how the chi-square distribution arises
Are the Pearson Father and Son data sampled from a bivariate normal distribution?
Example of a heavy-tailed distribution (ratio of normals with mean 0) to show that not all distributions can be assume to have means
Testing whether any variable help to predict worker's wages - leading up to the F-test
Lecture 11 2/24/10
Notes on normal distributions & the linear model (stuff added) - pdf
Multivariate normal distribution properties
F distribution
Linear model in matrix form
The F-test
Doing matrix calculations done by LM in R
F-test in R
Lecture 12 2/26/10
Paul Maiste (Lityx) - Monday Lecture
Notes on R-squared & Interactions
What is R squared?
Writing your own R functions
What are interactions for pairs of categorical variables?
Lecture 13 3/1/10
Paul Maiste
Lecture 14 3/3/10
Handwritten notes (pdf) on Akaike's information criterion & maximum likelihood
Applications of looping in R
Stepwise regression
Akaike's information criterion
Lecture 15 3/5/10
Projects
Implementing leave-one-out cross-validation
The which() command
Lecture 16 3/8/10
Quiz on Wednesday
Notes on logistic regression, likelihood for logistic regression, parametric bootstrap
Analysis of student's snail ring predictions
The order() command
Modeling binary response data
Logit model
Maximum likelihood
Structure of log-likelihood for logit model
R's glm function
Parametric bootstrap for sampling variability
Simulating binary response data
Lecture 17 3/10/2010
GLM logistic regression output
Fitted values - two types
Residuals
GLM predictions
Separating training and testing
Stepwise fitting
Lecture 18 3/12/2010
Susan Wierman, Executive Director
Mid-Atlantic Regional Air Management Association
Lecture 19 3/22/2010
Handwritten notes on ROC and nonparametric bootstrap
ROC curves
sapply, tapply, lapply
Lecture 20 3/24/2010
Carl Liggio - Analyzing Power Prices - notes were emailed
Lecture 21 3/26/2010
Classification trees
ROC curve comparison
adding a legend()
Various applications of sapply()
applying a function to a vector
vector output
arguments function
nonparametric bootstrap
Lecture 22 3/29/2010
Lecture 23 3/31/2010
Quiz Friday
Handwritten notes on survival analysis
Survival functions
Exponential survival distributions
Estimation of the survival curve when there is no censoring
Censored data
Probabilities as products of conditional probabilities
Kaplan-Meier estimator
R's aml dataset
Lecture 24 4/2/2010
R's survival package
Surv() function and objects it creates
The survfit()
Lecture 25 4/5/2010
Quiz #5 - Monday April 12th
Cox' proportional hazard model
The Rossi criminal recidivism dataset (.csv file)
Stepwise Cox proportional hazards modeling in R
Cox Proportional-Hazards Regression for Survival Data by John Fox
Lecture 26 4/7/2010
Guidelines for end of semester projects:
Looking for a report with the following components:1) Introduction to the problem - goals
2) Description of data including source
3) Statistical methodology implemented
4) Report on findings w/ text, tables & figures
5) Discussion of strengths and weaknesses of approach
5) Conclusions
6) References
7) Summary of contribution of each team memberGrading will be based on evidence of:
1) Creativity
2) Clarity of presentation
3) Thoroughness of investigation
4) Thoughtfulness of investigation
5) Effort expended
6) Critical thinkingThe data() command
Quantile regression
Quantile as solutions to a minimization
Regression quantiles
Analysis of the Engel food expenditure data
Quantiles as solutions to a minmization problem & quantile regression
Lecture 27 4/9/2010
Reminder: quiz monday
Nonparametric regression kernel smoothing
Why we smooth
Predicting temperature on a given day in the future
Finding the best window size
Smoothing in R
nonparametric regression kernels
Lecture 28 4/12/2010
Reading sampled lines from large datasets
Autoregressive models
Moving average models
Lecture 29 4/14/2010
Time series models (continuation of written notes from last lecture)
Moving average models
Autocovariance/Autocorrelation functions (ACFs)
ACF of a moving average model
Fitting AR, MA models
ARMA models
Writing formatted output using cat() and sink() functions
Lecture 30 4/16/2010
Time series models (continued from last lecture)
Differencing for stationarity
Returns for financial series
ARIMA models
Definition
Fitting
Prediction
Lecture 31 4/19/2010
Time series models (continued) - AR(1) autocorrelation function
Bootstrapping ARIMA models
Time Series in the Frequency Domain
Lecture 32 4/21/2010
Discrete Fourier Transforms (continued)
Explaining the Fast Fourier Transform (FFT)
What do periodicities look like?
What happens when we transform white noise?
Linearity and superposition
Plotting a complex series
Why we look at real & imaginary parts
The periodogram
Lecture 33 4/23/2010
More properties of discrete Fourier transforms
Spectrum of an AR(1)
Smoothing to estimate a continuous spectrum
Multivariate autogregression models using the mAr library
Sampling using mAr.sim()
Estimation using mAr.est()
Lecture 34 4/26/2010
R script for Lecture 34 (part 1 - mAR analysis of dow+treasuries)
R script for Lecture 34 (part 2 - cluster analysis)
Cluster analysis with hclust()
Lecture 35 4/28/2010
Introduction to spatial statistics
John Snow & the 1854 cholera outbreak
Baddely's analyzing point patterns in R (200 pages)
Homogeneous Poisson point processes (complete spatial randomness)
Inhomogeneous Poisson point processes
Lecture 36 4/30/2010
Lecture 37 5/3/2010
Baltimore homicide data
Creating a ppp object with a polygonal boundary
Image objects
Density plots
Perspective plots
Contour plots
Fitting a homogeneous Poisson point process
Fitting an inhomogeneous Poisson point process
Lecture 38 5/5/2010
Operations on windows
intersections
unions
complements
Tesselations
Dirichlet/Voronoi tesselaion
Delauney tesselation
An analog of quadrat sampling with a covariate
Fitting a ppp with covariates
Inference for parameters via vcov
Generating realizations from the fitted model
Matern type I and type II processes
Lecture 39 5/7/2010
Future topics
input/output
- sink()
- cat()
- scan()
- ranking & sorting
- rank()
- sort()
- order()