550.400 Mathematical Modeling and Consulting
Spring 2010
Naiman
Lecture notes

Lecture 1 - 1/25/10

Overview of the course, syllabus. Getting acquainted.

R installation

Getting started with R:

writing scripts and passing commands to the console

creating a vector of numbers, strings or TRUE/FALSE values

the objects() command

the history() command

copying the R icon to a folder

setting up the "Start in" folder

leaving R and returning

organizing projects in folders

Lecture 2 - 1/27/10

Reading assignment: Read Chapters 1 & 2 in Introduction to R by Friday Chapters 3 & 4 by Monday

R Script for Lecture 2

Some statistical concepts discussed:

Properties of data vs. properties of idealized model that describes how data are generated

empirical cdf for data

quantile function for data

Statistical model: data come from an exponential distribution:

cdf: F(x) = 1-exp(-lambda*x) for x>0

pdf: f(x) = lambda*exp(-lambda*x) for x>0

fitting based on data by estimating lambda: lambda-hat=1/mean of data

If we sample a distribution F to get X1,X2,.... then F(X1),F(X2),... should look like a Uniform(0,1) sample

Working with R

• working directory
• getwd()
• setwd()
• dir()
• source()
• getting help
• ?
• help.search() & appropos
• http://rseek.org/
• using R editor to scripts and run commands
• objects
• listing
• creating
• removing
• typeof
• saving workspace
• recalling previous commands
• up/down arrow keys
• Editing the Rconsole file: on my Windows 7 system to be found at C:\Program Files (x86)\R\R-2.10.1\etc
• arithmetic operations on vectors
• transformations of vectors
• exp()
• sqrt()
• log()
• combining vectors
• random number generation
• rnorm()
• runif()
• internal datasets
• sunspots
• elementary data summaries
• sum()
• mean()
• var(x)
• sd(x)
• max(x)
• quantile(x)
• hist(x)
• plot(x)

Lecture 3 1/29/10

The sample() command

The (non-parametric) boostrap method for determining sampling variability for an estimator (illustrated with an exponential rate parameter estimate).

Working with R

• lists
• regular sequences
• seq() and :
• rep()
• parameters in functions - e.g. seq(), hist(), rep()
• logical
• logic for individual values
• logic for vectors
• missing values
• coding missing values
• checking for missing with is.na()
• the na.rm option
• names in vectors
• selecting elements a vector using x[v], v a vector & assignment
• v a vector of positive integers
• v a vector of negative integers
• v a logical vector
• names of vector elements

Lecture 4 2/1/10

Working with R

• is
• is.vector()
• is.character()
• is.logical()
• is.numeric()
• is.list()
• [[ ]] vs [ ] for lists
• abbreviating names in lists
• adding an element to a list
• method 1: using [[ ]]
• method 2: using [ ]
• concatenating lists using c()
• matrices
• creating
• subscripting
• dimension of
• picking out rows & columns
• submatrices
• transposing
• data frames
• creating from a list of columns
• from a file
• from the internet

Lecture 5 2/3/10

Working with R

• search path
• search()
• attach() detach()
• categorical data
• factors
• the table() command & cross-tabulation
• more on the scan() command
• strsplit()
• unlist()
• more on subscripting - the Monte-Carlo method
• a conditional probability - the truncated normal distribution
• a conditional expectation
• normal distrubution
• rnorm()
• qnorm()
• dnorm()
• pnorm()
• QQplots & qqnorm()

Lecture 6 2/5/10

R script for Lecture 6

• NHANES urinary heavy metals data
• installing the foreign library
• the complete.cases() function
• correlation coefficients & correlation matrices
• prediction & using separate of training and testing data
• heavy-tailed distributions
• log transformation
• lognormal distribution
• correlations
• pearson & best linear prediction

Lecture 7 2/15/10

Reading assignment - Chapters 7 & 8 by Wednesday

Quiz Friday - sample questions will be posted on Wednesday

R script for Lecture 7

pop data for states (link to Wikipedia)

.csv file - after importing into Excel, copy & pasting table, exporting to .csv file

tax data for states (link to Wikipedia)

.csv file - after importing into Excel, copy & pasting table, exporting to .csv file

• merging data frames - merge() with applications
• simple example height/weight data
• state population vs taxes
• NHANES - merging lab & demographic data
• substrings using the substr() command
• converting strings to numeric values with as.numeric()
• plotting all pairs using the pairs() command
• boxplot()
• two sample tests
• t.test()
• wilcox.test()

Lecture 8 2/17/10

R script for Lecture 8

• the simple linear model assumptions
• Y[i]=alpha+beta*x[i]+e[i]
• e[i] ~ N(0,sigma^2) & independent
• alpha, beta, sigma unknown
• fitting simple linear models using LM<-lm(y~x)
• summary(LM)
• intercept and slope estimates for alpha and beta
• residual standard error estimates
• t-tests for hypotheses about alpha and beta
• fitted values
• residuals
• inspecting residuals vs. fitted values
• examining residuals for normality
• plot(LM)
• predict.lm()
• confidence intervals are for expected values of Y
• prediction intervals are for specific Y's
• adding to an existing plot
• add a line/curve using lines()
• axis labels xlab= ylab=
• plot title using main=
• putting multiple plots on the same page using mfrow=c(,)
• plot/graphics devices
• dev.list()
• windows()
• pdf() to create a pdf graphics device
• dev.off() to close the current graphics device, or a specified graphics device
• dev.cur() for the current graphics device

Lecture 9 2/19/10

Linear models continued

• the no-intercept specification
• multiple predictor variables
• what is an additive model?
• what are interactions?
• categorical variables as predictors
• converting to factor type using as.factor()

Lecture 10 2/22/10

Notes on normal distributions & the linear model - pdf

R script for Lecture 10

Univariate normal distribution

• mean and variance
• if X has a normal distribution, what is the distribution of aX+b?
• how the chi-square distribution arises

Bivariate normal distribution

• pdf
• mean vector & covariance matrix
• linear combinations have univariate normal distributions
• how to sample the bivariate normal distribution
• how the chi-square distribution arises

Multivariate normal distribution

• mean and variance
• linear combinations have univariate normal distributions
• how the chi-square distribution arises
The linear model in matrix form

Are the Pearson Father and Son data sampled from a bivariate normal distribution?

Example of a heavy-tailed distribution (ratio of normals with mean 0) to show that not all distributions can be assume to have means

Testing whether any variable help to predict worker's wages - leading up to the F-test

Lecture 11 2/24/10

Notes on normal distributions & the linear model (stuff added) - pdf

R script for Lecture 11

Multivariate normal distribution properties

F distribution

Linear model in matrix form

The F-test

Doing matrix calculations done by LM in R

F-test in R

Lecture 12 2/26/10

Paul Maiste (Lityx) - Monday Lecture

What is R squared?

What are interactions for pairs of categorical variables?

Lecture 13 3/1/10

Paul Maiste

Lecture 14 3/3/10

R script for Lecture 14

Applications of looping in R

Stepwise regression

Akaike's information criterion

Lecture 15 3/5/10

Projects

R script for Lecture 15

Implementing leave-one-out cross-validation

The which() command

Lecture 16 3/8/10

Quiz on Wednesday

R script for Lecture 16

Notes on logistic regression, likelihood for logistic regression, parametric bootstrap

Analysis of student's snail ring predictions

The order() command

Modeling binary response data

Logit model

Maximum likelihood

Structure of log-likelihood for logit model

R's glm function

Parametric bootstrap for sampling variability

Simulating binary response data

Lecture 17 3/10/2010

GLM logistic regression output

Fitted values - two types

Residuals

GLM predictions

Separating training and testing

Stepwise fitting

Lecture 18 3/12/2010

Susan Wierman, Executive Director

Mid-Atlantic Regional Air Management Association

A Brief Intoduction to Air Quality Data Analysis (ppt file)

Lecture 19 3/22/2010

R script for Lecture 19

Handwritten notes on ROC and nonparametric bootstrap

ROC curves

sapply, tapply, lapply

Lecture 20 3/24/2010

Carl Liggio - Analyzing Power Prices - notes were emailed

Lecture 21 3/26/2010

R script for Lecture 21

Classification trees

ROC curve comparison

Various applications of sapply()

applying a function to a vector

vector output

arguments function

nonparametric bootstrap

Lecture 22 3/29/2010

Yichen Qin - Rtcltk package Lecture (examples)

Lecture 23 3/31/2010

Quiz Friday

R Script for Lecture 23

Survival functions

Exponential survival distributions

Estimation of the survival curve when there is no censoring

Censored data

Probabilities as products of conditional probabilities

Kaplan-Meier estimator

R's aml dataset

Lecture 24 4/2/2010

R script for Lecture 24

R's survival package

Surv() function and objects it creates

The survfit()

Lecture 25 4/5/2010

Quiz #5 - Monday April 12th

R script/template for HW#5

R script for Lecture 25

Cox' proportional hazard model

The Rossi criminal recidivism dataset (.csv file)

Stepwise Cox proportional hazards modeling in R

Cox Proportional-Hazards Regression for Survival Data by John Fox

Lecture 26 4/7/2010

Guidelines for end of semester projects:
Looking for a report with the following components:

1) Introduction to the problem - goals
2) Description of data including source
3) Statistical methodology implemented
4) Report on findings w/ text, tables & figures
5) Discussion of strengths and weaknesses of approach
5) Conclusions
6) References
7) Summary of contribution of each team member

Grading will be based on evidence of:

1) Creativity
2) Clarity of presentation
3) Thoroughness of investigation
4) Thoughtfulness of investigation
5) Effort expended
6) Critical thinking

R script for Lecture 26

The data() command

Quantile regression

Quantile as solutions to a minimization

Regression quantiles

Analysis of the Engel food expenditure data

Quantiles as solutions to a minmization problem & quantile regression

Lecture 27 4/9/2010

Reminder: quiz monday

Nonparametric regression kernel smoothing

Why we smooth

Predicting temperature on a given day in the future

Finding the best window size

Smoothing in R

nonparametric regression kernels

Lecture 28 4/12/2010

Reading sampled lines from large datasets

Time series models

Autoregressive models

Moving average models

Lecture 29 4/14/2010

Time series models (continuation of written notes from last lecture)

Moving average models

Autocovariance/Autocorrelation functions (ACFs)

ACF of a moving average model

Fitting AR, MA models

ARMA models

Writing formatted output using cat() and sink() functions

Lecture 30 4/16/2010

Differencing for stationarity

Returns for financial series

ARIMA models

Definition

Fitting

Prediction

Lecture 31 4/19/2010

R script for Lecture 31

Time series models (continued) - AR(1) autocorrelation function

Bootstrapping ARIMA models

Time Series in the Frequency Domain

Lecture 32 4/21/2010

R script for Lecture 32

Explaining the Fast Fourier Transform (FFT)

What do periodicities look like?

What happens when we transform white noise?

Linearity and superposition

Plotting a complex series

Why we look at real & imaginary parts

The periodogram

Lecture 33 4/23/2010

More properties of discrete Fourier transforms

Spectrum of an AR(1)

Smoothing to estimate a continuous spectrum

Multivariate autogregression models using the mAr library

Sampling using mAr.sim()

Estimation using mAr.est()

Lecture 34 4/26/2010

R script for Lecture 34 (part 1 - mAR analysis of dow+treasuries)

R script for Lecture 34 (part 2 - cluster analysis)

Cluster analysis with hclust()

Lecture 35 4/28/2010

R script

Introduction to spatial statistics

John Snow & the 1854 cholera outbreak

Baddely's analyzing point patterns in R (200 pages)

Homogeneous Poisson point processes (complete spatial randomness)

Inhomogeneous Poisson point processes

Lecture 36 4/30/2010

R script

Lecture 37 5/3/2010

Baltimore homicide data

Creating a ppp object with a polygonal boundary

Image objects

Density plots

Perspective plots

Contour plots

Fitting a homogeneous Poisson point process

Fitting an inhomogeneous Poisson point process

Lecture 38 5/5/2010

R script

Operations on windows

intersections

unions

complements

Tesselations

Dirichlet/Voronoi tesselaion

Delauney tesselation

An analog of quadrat sampling with a covariate

Fitting a ppp with covariates

Inference for parameters via vcov

Generating realizations from the fitted model

Matern type I and type II processes

Lecture 39 5/7/2010

Future topics

• spearman's rank correlation cor( , method=spearman)
• correlation tests using cor.test()

input/output

• sink()
• cat()
• scan()
• ranking & sorting
• rank()
• sort()
• order()