**550.400 Mathematical Modeling and Consulting
Spring 2010
Naiman
Lecture notes**

Lecture 1 - 1/25/10

Overview of the course, syllabus. Getting acquainted.

R installation

Getting started with R:

writing scripts and passing commands to the console

creating a vector of numbers, strings or TRUE/FALSE values

the objects() command

the history() command

copying the R icon to a folder

setting up the "Start in" folder

leaving R and returning

organizing projects in folders

Lecture 2 - 1/27/10

Reading assignment: Read Chapters 1 & 2 in Introduction to R by Friday Chapters 3 & 4 by Monday

Some statistical concepts discussed:

Properties of data vs. properties of idealized model that describes how data are generated

empirical cdf for data

quantile function for data

Statistical model: data come from an exponential distribution:

cdf: F(x) = 1-exp(-lambda*x) for x>0

pdf: f(x) = lambda*exp(-lambda*x) for x>0

fitting based on data by estimating lambda: lambda-hat=1/mean of data

If we sample a distribution F to get X1,X2,.... then F(X1),F(X2),... should look like a Uniform(0,1) sample

Working with R

- working directory

- getwd()
- setwd()
- dir()
- source()
- getting help

- ?
- help.search() & appropos
- http://rseek.org/
- using R editor to scripts and run commands
- objects

- listing
- creating
- removing
- typeof
- saving workspace
- recalling previous commands

- up/down arrow keys
- Editing the Rconsole file: on my Windows 7 system to be found at C:\Program Files (x86)\R\R-2.10.1\etc
- arithmetic operations on vectors
- transformations of vectors

- exp()
- sqrt()
- log()
- combining vectors
- random number generation

- rnorm()
- runif()
- internal datasets

- sunspots
- elementary data summaries

- sum()
- mean()
- var(x)
- sd(x)
- max(x)
- quantile(x)
- hist(x)
- plot(x)

Lecture 3 1/29/10

The sample() command

The (non-parametric) boostrap method for determining sampling variability for an estimator (illustrated with an exponential rate parameter estimate).

Working with R

- lists
- regular sequences

- seq() and :
- rep()
- parameters in functions - e.g. seq(), hist(), rep()
- logical

- logic for individual values
- logic for vectors
- missing values

- coding missing values
- checking for missing with is.na()
- the na.rm option
- names in vectors
- selecting elements a vector using x[v], v a vector & assignment

- v a vector of positive integers
- v a vector of negative integers
- v a logical vector
- names of vector elements

Lecture 4 2/1/10

Reading assignment: Read Chapters 6 & 7 by Friday 2/5/10

Working with R

- is

- is.vector()
- is.character()
- is.logical()
- is.numeric()
- is.list()
- more about lists

- [[ ]] vs [ ] for lists
- abbreviating names in lists
- adding an element to a list

- method 1: using [[ ]]
- method 2: using [ ]
- concatenating lists using c()
- matrices

- creating
- subscripting
- dimension of
- picking out rows & columns
- submatrices
- transposing
- data frames

- creating from a list of columns
- the read.table() command

- from a file
- from the internet
- read.csv()

Lecture 5 2/3/10

Working with R

- search path

- search()
- attach() detach()
- categorical data

- factors
- the table() command & cross-tabulation
- reading the bible

- more on the scan() command
- strsplit()
- unlist()
- more on subscripting - the Monte-Carlo method

- a conditional probability - the truncated normal distribution
- a conditional expectation
- normal distrubution

- rnorm()
- qnorm()
- dnorm()
- pnorm()
- QQplots & qqnorm()

Lecture 6 2/5/10

- NHANES urinary heavy metals data
- installing the foreign library
- the read.xport() function
- the complete.cases() function
- correlation coefficients & correlation matrices
- prediction & using separate of training and testing data
- heavy-tailed distributions
- log transformation
- lognormal distribution

- correlations
- pearson & best linear prediction

Lecture 7 2/15/10

Reading assignment - Chapters 7 & 8 by Wednesday

Quiz Friday - sample questions will be posted on Wednesday

pop data for states (link to Wikipedia)

.csv file - after importing into Excel, copy & pasting table, exporting to .csv file

tax data for states (link to Wikipedia)

.csv file - after importing into Excel, copy & pasting table, exporting to .csv file

- merging data frames - merge() with applications

- simple example height/weight data
- state population vs taxes
- NHANES - merging lab & demographic data
- substrings using the substr() command
- converting strings to numeric values with as.numeric()
- plotting all pairs using the pairs() command
- boxplot()
- two sample tests

- t.test()
- wilcox.test()

Lecture 8 2/17/10

- the simple linear model assumptions
- Y[i]=alpha+beta*x[i]+e[i]
- e[i] ~ N(0,sigma^2) & independent
- alpha, beta, sigma unknown

- fitting simple linear models using LM<-lm(y~x)
- summary(LM)
- intercept and slope estimates for alpha and beta
- residual standard error estimates
- t-tests for hypotheses about alpha and beta
- fitted values
- residuals
- inspecting residuals vs. fitted values
- examining residuals for normality

- plot(LM)
- predict.lm()
- confidence intervals are for expected values of Y
- prediction intervals are for specific Y's

- summary(LM)
- adding to an existing plot
- add a line/curve using lines()
- adding text using text()
- axis labels xlab= ylab=
- plot title using main=

- putting multiple plots on the same page using mfrow=c(,)
- plot/graphics devices
- dev.list()
- windows()
- pdf() to create a pdf graphics device
- dev.off() to close the current graphics device, or a specified graphics device
- dev.cur() for the current graphics device

Lecture 9 2/19/10

Linear models continued

- the no-intercept specification
- multiple predictor variables
- including quadratic terms
- what is an additive model?
- what are interactions?
- categorical variables as predictors

- converting to factor type using as.factor()

Lecture 10 2/22/10

Notes on normal distributions & the linear model - pdf

Univariate normal distribution

- mean and variance
- if X has a normal distribution, what is the distribution of aX+b?
- how the chi-square distribution arises
Bivariate normal distribution

- mean vector & covariance matrix
- linear combinations have univariate normal distributions
- how to sample the bivariate normal distribution
- how the chi-square distribution arises
Multivariate normal distribution

The linear model in matrix form

- mean and variance
- linear combinations have univariate normal distributions
- how the chi-square distribution arises
Are the Pearson Father and Son data sampled from a bivariate normal distribution?

Example of a heavy-tailed distribution (ratio of normals with mean 0) to show that not all distributions can be assume to have means

Testing whether any variable help to predict worker's wages - leading up to the F-test

Lecture 11 2/24/10

Notes on normal distributions & the linear model (stuff added) - pdf

Multivariate normal distribution properties

F distribution

Linear model in matrix form

The F-test

Doing matrix calculations done by LM in R

F-test in R

Lecture 12 2/26/10

Paul Maiste (Lityx) - Monday Lecture

Notes on R-squared & Interactions

What is R squared?

Writing your own R functions

What are interactions for pairs of categorical variables?

Lecture 13 3/1/10

Paul Maiste

Lecture 14 3/3/10

Handwritten notes (pdf) on Akaike's information criterion & maximum likelihood

Applications of looping in R

Stepwise regression

Akaike's information criterion

Lecture 15 3/5/10

Projects

Implementing leave-one-out cross-validation

The which() command

Lecture 16 3/8/10

Quiz on Wednesday

Notes on logistic regression, likelihood for logistic regression, parametric bootstrap

Analysis of student's snail ring predictions

The order() command

Modeling binary response data

Logit model

Maximum likelihood

Structure of log-likelihood for logit model

R's glm function

Parametric bootstrap for sampling variability

Simulating binary response data

Lecture 17 3/10/2010

GLM logistic regression output

Fitted values - two types

Residuals

GLM predictions

Separating training and testing

Stepwise fitting

Lecture 18 3/12/2010

Susan Wierman, Executive Director

Mid-Atlantic Regional Air Management Association

Lecture 19 3/22/2010

Handwritten notes on ROC and nonparametric bootstrap

ROC curves

sapply, tapply, lapply

Lecture 20 3/24/2010

Carl Liggio - Analyzing Power Prices - notes were emailed

Lecture 21 3/26/2010

Classification trees

ROC curve comparison

adding a legend()

Various applications of sapply()

applying a function to a vector

vector output

arguments function

nonparametric bootstrap

Lecture 22 3/29/2010

Lecture 23 3/31/2010

Quiz Friday

Handwritten notes on survival analysis

Survival functions

Exponential survival distributions

Estimation of the survival curve when there is no censoring

Censored data

Probabilities as products of conditional probabilities

Kaplan-Meier estimator

R's aml dataset

Lecture 24 4/2/2010

R's survival package

Surv() function and objects it createsThe

survfit()

Lecture 25 4/5/2010

Quiz #5 - Monday April 12th

Cox' proportional hazard model

The Rossi criminal recidivism dataset (.csv file)

Stepwise Cox proportional hazards modeling in R

Cox Proportional-Hazards Regression for Survival Data by John Fox

Lecture 26 4/7/2010

Guidelines for end of semester projects:

Looking for a report with the following components:1) Introduction to the problem - goals

2) Description of data including source

3) Statistical methodology implemented

4) Report on findings w/ text, tables & figures

5) Discussion of strengths and weaknesses of approach

5) Conclusions

6) References

7) Summary of contribution of each team memberGrading will be based on evidence of:

1) Creativity

2) Clarity of presentation

3) Thoroughness of investigation

4) Thoughtfulness of investigation

5) Effort expended

6) Critical thinkingThe data() command

Quantile regression

Quantile as solutions to a minimization

Regression quantiles

Analysis of the Engel food expenditure data

Quantiles as solutions to a minmization problem & quantile regression

Lecture 27 4/9/2010

Reminder: quiz monday

Nonparametric regression kernel smoothing

Why we smooth

Predicting temperature on a given day in the future

Finding the best window size

Smoothing in R

nonparametric regression kernels

Lecture 28 4/12/2010

Reading sampled lines from large datasets

Autoregressive models

Moving average models

Lecture 29 4/14/2010

Time series models (continuation of written notes from last lecture)

Moving average models

Autocovariance/Autocorrelation functions (ACFs)

ACF of a moving average model

Fitting AR, MA models

ARMA models

Writing formatted output using

cat()andsink()functions

Lecture 30 4/16/2010

Time series models (continued from last lecture)

Differencing for stationarity

Returns for financial series

ARIMA models

Definition

Fitting

Prediction

Lecture 31 4/19/2010

Time series models (continued) - AR(1) autocorrelation function

Bootstrapping ARIMA models

Time Series in the Frequency Domain

Lecture 32 4/21/2010

Discrete Fourier Transforms (continued)

Explaining the Fast Fourier Transform (FFT)

What do periodicities look like?

What happens when we transform white noise?

Linearity and superposition

Plotting a complex series

Why we look at real & imaginary parts

The periodogram

Lecture 33 4/23/2010

More properties of discrete Fourier transforms

Spectrum of an AR(1)

Smoothing to estimate a continuous spectrum

Multivariate autogregression models using the mAr library

Sampling using mAr.sim()

Estimation using mAr.est()

Lecture 34 4/26/2010

R script for Lecture 34 (part 1 - mAR analysis of dow+treasuries)

R script for Lecture 34 (part 2 - cluster analysis)

Cluster analysis with hclust()

Lecture 35 4/28/2010

Introduction to spatial statistics

John Snow & the 1854 cholera outbreak

Baddely's analyzing point patterns in R (200 pages)

Homogeneous Poisson point processes (complete spatial randomness)

Inhomogeneous Poisson point processes

Lecture 36 4/30/2010

Lecture 37 5/3/2010

Baltimore homicide data

Creating a ppp object with a polygonal boundary

Image objects

Density plots

Perspective plots

Contour plots

Fitting a homogeneous Poisson point process

Fitting an inhomogeneous Poisson point process

Lecture 38 5/5/2010

Operations on windows

intersections

unions

complements

Tesselations

Dirichlet/Voronoi tesselaion

Delauney tesselation

An analog of quadrat sampling with a covariate

Fitting a ppp with covariates

Inference for parameters via vcov

Generating realizations from the fitted model

Matern type I and type II processes

Lecture 39 5/7/2010

Future topics

- spearman's rank correlation cor( , method=spearman)
- correlation tests using cor.test()

input/output

- sink()

- cat()
- scan()
- ranking & sorting

- rank()
- sort()
- order()