Choosing the appropriate method
It is essential, therefore, that you can answer the following questions:
- Which of your variables is the response variable?
- Which are the explanatory variables?
- Are the explanatory variables continuous or categorical, or a mixture of both?
- What kind of response variable do you have: is it a continuous measurement, a count, a proportion, a time at death, or a category?
Choosing the appropriate method
Explanatory Variables are |
|
all continuous |
Regression |
all categorical |
Analysis of variance (ANOVA) |
both continuous and categorical |
Analysis of covariance (ANCOVA) |
Choosing the appropriate method
Response Variables |
|
(a) Continuous |
Normal regression, ANOVA or ANCOVA |
(b) Proportion |
Logistic regression |
(c) Count |
Log-linear models |
(d) Binary |
Binary logistic analysis |
(e) Time at death|Survival analysis |
The best model is the model that produces the least unexplained variation (the minimal residual)
Choosing the appropriate method
- It is very important to understand that there is not one model;
- there will be a large number of different, more or less plausible models that might be fitted to any given set of data.
Maximum Likelihood
We define best in terms of maximum likelihood.
- given the data,
- and given our choice of model,
- what values of the parameters of that model make the observed data most likely?
We judge the model on the basis how likely the data would be if the model were correct.
Ockham's Razor
The principle is attributed to William of Ockham, who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation. The most useful statement of the principle for scientists is when you have two competing theories which make exactly the same predictions, the one that is simpler is the better.
Ockham's Razor
For statistical modelling, the principle of parsimony means that:
- models should have as few parameters as possible;
- linear models should be preferred to non-linear models;
- experiments relying on few assumptions should be preferred to those relying on many;
- models should be pared down until they are minimal adequate;
- simple explanations should be preferred to complex explanations.
Types of Models
Fitting models to data is the central function of R. There are no fixed rules and no absolutes. The object is to determine a minimal adequate model from a large set of potential models. For this reason we looking at the following types of models:
- the null model;
- the minimal adequate model;
- the maximal model; and
- the saturated model.
The Null model
- Just one parameter, the overall mean ybar
- Fit: none; SSE = SSY
- Degrees of freedom: n-1
- Explanatory power of the model: none
Adding Information
model with
latex error! exitcode was 1 (signal 0), transscript follows: This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex) entering extended mode (./latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.tex LaTeX2e <2018-12-01> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2018/09/03 v1.4i Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) No file latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.aux. ! Missing $ inserted. <inserted text> $ l.10 0 \le p' \le p (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Missing $ inserted. <inserted text> $ l.11 \end{document} [1] (./latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.aux) ) (see the transcript file for additional information) Output written on latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.dvi (1 page, 360 bytes). Transcript written on latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.log.
parameters- Fit: less than the maximal model, but not significantly so
- Degrees of freedom:
- Explanatory power of the model:
latex error! exitcode was 1 (signal 0), transscript follows: This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex) entering extended mode (./latex_a5055732c73b7f197f3612441a7ab892faf208d6_p.tex LaTeX2e <2018-12-01> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2018/09/03 v1.4i Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) No file latex_a5055732c73b7f197f3612441a7ab892faf208d6_p.aux. ! Missing $ inserted. <inserted text> $ l.10 r^ 2 = \frac{SSR}{SSY} (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Missing $ inserted. <inserted text> $ l.11 \end{document} [1] (./latex_a5055732c73b7f197f3612441a7ab892faf208d6_p.aux) ) (see the transcript file for additional information) Output written on latex_a5055732c73b7f197f3612441a7ab892faf208d6_p.dvi (1 page, 352 bytes). Transcript written on latex_a5055732c73b7f197f3612441a7ab892faf208d6_p.log.
Adding Information
- model with
latex error! exitcode was 1 (signal 0), transscript follows: This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex) entering extended mode (./latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.tex LaTeX2e <2018-12-01> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2018/09/03 v1.4i Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) No file latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.aux. ! Missing $ inserted. <inserted text> $ l.10 0 \le p' \le p (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Missing $ inserted. <inserted text> $ l.11 \end{document} [1] (./latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.aux) ) (see the transcript file for additional information) Output written on latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.dvi (1 page, 360 bytes). Transcript written on latex_3ee7b6aa16682211bcb5a1aa57c2f403f4126ceb_p.log.
parameters
- Fit: less than the maximal model, but not significantly so
- Degrees of freedom:
- Explanatory power of the model:
latex error! exitcode was 1 (signal 0), transscript follows: This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex) entering extended mode (./latex_9eb3a7c2333afebfcf98f45150d8be72b7a0a519_p.tex LaTeX2e <2018-12-01> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2018/09/03 v1.4i Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) No file latex_9eb3a7c2333afebfcf98f45150d8be72b7a0a519_p.aux. ! Missing $ inserted. <inserted text> $ l.10 r^ 2 = \frac{SSA}{SSY} (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Missing $ inserted. <inserted text> $ l.11 \end{document} [1] (./latex_9eb3a7c2333afebfcf98f45150d8be72b7a0a519_p.aux) ) (see the transcript file for additional information) Output written on latex_9eb3a7c2333afebfcf98f45150d8be72b7a0a519_p.dvi (1 page, 352 bytes). Transcript written on latex_9eb3a7c2333afebfcf98f45150d8be72b7a0a519_p.log.
Saturated/Maximal Model
saturated model
- One parameter for every data point
- Fit: perfect
- Degrees of freedom: none
- Explanatory power of the model: none
maximal model
- Contains all p factors, interactions and covariates that
- Degrees of freedom:
- Explanatory power of the model: it depends
How to choose...
- models are representations of reality that should be both accurate and convenient
- it is impossible to maximize a model’s realism, generality and holism simultaneously
- the principle of parsimony is a vital tool in helping to choose one model over another
- only include an explanatory variable in a model if it significantly improved the fit of the model (or if there other strong reasons)
- the fact that we went to the trouble of measuring something does not mean we have to have it in our model
ANOVA
- a technique we use when all explanatory variables are categorical (factor)
if there is one factor with three or more levels we use one-way ANOVA (only two levels: t-test should be preferred, would give exactly the same answer since with 2 levels
latex error! exitcode was 1 (signal 0), transscript follows: This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex) entering extended mode (./latex_5010ed318600602bb27266056013fa5bd73eb0fd_p.tex LaTeX2e <2018-12-01> (/usr/share/texlive/texmf-dist/tex/latex/base/article.cls Document Class: article 2018/09/03 v1.4i Standard LaTeX document class (/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo)) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty) No file latex_5010ed318600602bb27266056013fa5bd73eb0fd_p.aux. Overfull \hbox (15.69717pt too wide) in paragraph at lines 22--23 []\OT1/cmr/m/n/12 [[attachment:TSS1.png|attachment:TSS1.png||width=800,height=4 00]] Overfull \hbox (3.9473pt too wide) in paragraph at lines 28--29 []\OT1/cmr/m/n/12 [[attachment:TSS.png|attachment:TSS.png||width=800,height=400 ]] (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsa.fd) (/usr/share/texlive/texmf-dist/tex/latex/amsfonts/umsb.fd) ! Missing $ inserted. <inserted text> $ l.32 Overfull \hbox (15.69717pt too wide) in paragraph at lines 33--34 []\OT1/cmr/m/n/12 [[attachment:TSS2.png|attachment:TSS2.png||width=800,height=4 00]] Overfull \hbox (2.96399pt too wide) in paragraph at lines 41--42 []\OT1/cmr/m/n/12 [[attachment:ESS.png|attachment:ESS.png||width=800,height=400 ]] ! Missing $ inserted. <inserted text> $ l.51 Overfull \hbox (35.01529pt too wide) in paragraph at lines 43--51 \OT1/cmr/m/n/12 24.0$[[\OML/cmm/m/it/12 attachment \OT1/cmr/m/n/12 : \OML/cmm/m /it/12 ESS\OT1/cmr/m/n/12 2\OML/cmm/m/it/12 :png\OMS/cmsy/m/n/12 j[]\OT1/cmr/m/ n/12 ]]$ ! Missing $ inserted. <inserted text> $ l.61 ||Error || 24.0 ||12 || s^ 2=2.0 || || ! You can't use `macro parameter character #' in math mode. l.67 {{{# !highlight r ! Missing $ inserted. <inserted text> $ l.70 \end{document} ! Missing } inserted. <inserted text> } l.70 \end{document} ! Missing } inserted. <inserted text> } l.70 \end{document} ! Missing } inserted. <inserted text> } l.70 \end{document} Overfull \hbox (18.72272pt too wide) in paragraph at lines 53--70 \OT1/cmr/m/n/12 ||Er-ror || 24.0 ||12 || s$[] = 2\OML/cmm/m/it/12 :\OT1/cmr/m/n /12 0\OMS/cmsy/m/n/12 jjjjjj\OML/cmm/m/it/12 Total\OMS/cmsy/m/n/12 jj\OT1/cmr/m /n/12 55\OML/cmm/m/it/12 :\OT1/cmr/m/n/12 5\OMS/cmsy/m/n/12 jj\OT1/cmr/m/n/12 1 3\OMS/cmsy/m/n/12 jjjj \OT1/cmr/m/n/12 == \OML/cmm/m/it/12 ANOVA \OT1/cmr/m/n/1 2 == Overfull \hbox (106.79048pt too wide) in paragraph at lines 53--70 \OMS/cmsy/m/n/12 ^^C\OML/cmm/m/it/12 nowweneedtotestwhetheranFratioof\OT1/cmr/m /n/12 15\OML/cmm/m/it/12 :\OT1/cmr/m/n/12 75\OML/cmm/m/it/12 islargeorsmall \OM S/cmsy/m/n/12 ^^C \OML/cmm/m/it/12 wecanuseatableorsoftwarepackage \OMS/cmsy/m/ n/12 ^^C Overfull \hbox (137.28618pt too wide) in paragraph at lines 53--70 \OML/cmm/m/it/12 Iuseheresoftwaretocalculatethecumulativeprobability[]$ [1] (./latex_5010ed318600602bb27266056013fa5bd73eb0fd_p.aux) ) (see the transcript file for additional information) Output written on latex_5010ed318600602bb27266056013fa5bd73eb0fd_p.dvi (1 page, 3688 bytes). Transcript written on latex_5010ed318600602bb27266056013fa5bd73eb0fd_p.log.
ANOVA
<img alt='sesssion2/img/fdens.png' src='-1' />
ANOVA in R
- in R we use the lm() or the aov() command and
- the formula syntax a \sim b
- we assign this to an variable
ANOVA in R
ANOVA in R
ANOVA in R
ANOVA in R
1 > m2 <- aov(ozone ~ garden, data=oneway)
2 > m2
3 garden Residuals
4 Sum of Squares 31.5 24.0
5 Residual standard error: 1.414214
6 Estimated effects may be unbalanced
7 > summary(m2)
8 Df Sum Sq Mean Sq F value Pr(>F)
9 garden 1 31.5 31.5 15.75 0.00186 **
10 Residuals 12 24.0 2.0
11 > summary.lm(m2)
12 Min 1Q Median 3Q Max
13 Estimate Std. Error t value Pr(>|t|)
14 gardenb 3.0000 0.7559 3.969 0.00186 **
15 Residual standard error: 1.414 on 12 degrees of freedom
16 Multiple R-squared: 0.5676, Adjusted R-squared: 0.5315
17 > summary(m2)
18 Df Sum Sq Mean Sq F value Pr(>F)
19 garden 1 31.5 31.5 15.75 0.00186 **
20 Residuals 12 24.0 2.0
ANOVA Assumptions
- independed, normal distributed errors
- equality of variances (homogeneity)
Welch ANOVA
- generalization of the Welch t-test
- tests whether the means of the outcome variables are different across the factor levels
- assumes sufficiently large sample (greater than 10 times the number of groups in the calculation, groups of size one are to be excluded)
- sensitive to the existence of outliers (only few are allowed)
- the r command is oneway.test()
- non-parametric alternative kruskal.test()
* Look at the help of the TukeyHSD function. What is its purpose? * Execute the code of the example near the end of the help page, interpret the results! * install and load the granovaGG package (a package for visualization of ANOVAs), load the arousal data frame and use the stack() command to bring the data in the long form. Do a anova analysis. Is there a difference at least 2 of the groups? If indicated do a post-hoc test. * Visualize your results
Exercises - Solutions
* Look at the help of the TukeyHSD function. What is its purpose? * Execute the code of the example near the end of the help page, interpret the results! * install and load the granovaGG package (a package for visualization of ANOVAs), load the arousal data frame and use the stack() command to bring the data in the long form. Do a anova analysis. Is there a difference at least 2 of the groups? If indicated do a post-hoc test.\scriptsize
1 > require(granovaGG)
2 > data(arousal)
3 > datalong <- stack(arousal)
4 > m1 <- aov(values ~ ind, data = datalong)
5 > summary(m1)
6 Df Sum Sq Mean Sq F value Pr(>F)
7 ind 3 273.4 91.13 10.51 4.17e-05 ***
8 Residuals 36 312.3 8.68
9 > TukeyHSD(m1)
10 Tukey multiple comparisons of means
11 diff lwr upr p adj
* Visualize your results\scriptsize
<img alt='sesssion2/img/aovgr1.png' src='-1' />
1 > granovagg.1w(datalong$values,group = datalong$ind)
2 group group.mean trimmed.mean contrast variance standard.deviation
3 4 Placebo 20.43 20.30 -3.65 5.83 2.41
4 3 Drug.B 23.82 23.85 -0.26 7.50 2.74
5 1 Drug.A 24.27 24.45 0.19 7.89 2.81
6 2 Drug.A.B 27.81 27.52 3.73 13.49 3.67
7 4 10
8 3 10
9 1 10
10 2 10
11 Below is a linear model summary of your input data
12 Min 1Q Median 3Q Max
13 Estimate Std. Error t value Pr(>|t|)
14 groupPlacebo -3.8400 1.3172 -2.915 0.00608 **
15 Residual standard error: 2.945 on 36 degrees of freedom
16 Multiple R-squared: 0.4668, Adjusted R-squared: 0.4223
<img alt='sesssion2/img/aovgr2.png' src='-1' />