welcome: please sign in

Seiteninhalt hochladen

Sie können für die unten genannte Seite Inhalt hochladen. Wenn Sie den Seitennamen ändern, können Sie auch Inhalt für eine andere Seite hochladen. Wenn der Seitenname leer ist, leiten wir den Seitennamen vom Dateinamen ab.

Datei, aus der der Seiteninhalt geladen wird
Seitenname
Kommentar

Revision 2 vom 2015-05-02 08:13:10

location: RstatisTik / RstatisTikPortal / RcourSe / CourseOutline / TestsInR

Classical Tests

Exercises

Exercises - Solutions

   1 > sumdf <- data %>%
   2 +     group_by(Subject,Sex,Age_PRETEST,testid) %>%
   3 +     summarise(count=n(),
   4 +               n.corr = sum(Stim.Type=="hit"),
   5 +               perc.corr = n.corr/count,
   6 +               mean.ttime = mean(TTime),
   7 +               sd.ttime = sd(TTime),
   8 +               se.ttime = sd.ttime/sqrt(count))
   9 > head(sumdf)
  10 Subject Sex Age_PRETEST testid count n.corr perc.corr mean.ttime 
  11 1       1   f        3.11  test1    95     63 0.6631579   8621.674 
  12 2       1   f        3.11      1    60     32 0.5333333   9256.367 
  13 3       1   f        3.11      2    59     32 0.5423729   9704.712 
  14 4       1   f        3.11      3    60     38 0.6333333  14189.550 
  15 5       1   f        3.11      4    59     31 0.5254237  13049.831 
  16 6       1   f        3.11      5    59     33 0.5593220  14673.525 
  17 Variables not shown: sd.ttime se.ttime (dbl)  

four possible situations

Situation

H_0 is true

H_0 is false

Conclusion

H_0 is not rejected

Correct decision

Type II error

H_0 is rejected

Type I error

Correct decision

Common symbols

n

number of observations (sample size)

K

number of samples (each having n elements)

s

standard deviation (sample)

r

sample correlation coefficient

Z

standard normal deviate

Alternatives

<img alt='sesssion2/twosided.pdf}' src='-1' /> <img alt='sesssion2/greater.pdf}' src='-1' /> <img alt='sesssion2/less.pdf}' src='-1' />

Alternatives

The p-value is the probability of the sample estimate (of the respective estimator) under the null.

Z-test for a population mean

The z-test is a something like a t-test (it is like you would know almost everything about the perfect conditions. It uses the normal distribution as test statistic and is therefore a good example. To investigate the significance of the difference between an assumed population mean #!latex \mu_0 and a sample mean #!latex \bar{x}. * It is necessary that the population variance #!latex \sigma^2 is known. * The test is accurate if the population is normally distributed. If the population is not normal, the test will still give an approximate guide.

Z-test for a population mean

* Write a function which takes a vector, the population standard deviation and the population mean as arguments and which gives the Z score as result.

* add a line to your function that allows you to process numeric vectors containing missing values! * the function pnorm(Z) gives the probability of

latex error! exitcode was 1 (signal 0), transscript follows:

This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex)
entering extended mode
(./latex_3804b1112b760080c8c10d0951a569ca7c0b3197_p.tex
LaTeX2e <2018-12-01>
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2018/09/03 v1.4i Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo))
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty
For additional information on amsmath, use the `?' option.
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty)
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty)
(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)
No file latex_3804b1112b760080c8c10d0951a569ca7c0b3197_p.aux.
! You can't use `macro parameter character #' in horizontal mode.
l.14 {{{#
         !highlight r
[1] (./latex_3804b1112b760080c8c10d0951a569ca7c0b3197_p.aux) )
(\end occurred inside a group at level 3)

### simple group (level 3) entered at line 14 ({)
### simple group (level 2) entered at line 14 ({)
### simple group (level 1) entered at line 14 ({)
### bottom level
(see the transcript file for additional information)
Output written on latex_3804b1112b760080c8c10d0951a569ca7c0b3197_p.dvi (1 page,
 944 bytes).
Transcript written on latex_3804b1112b760080c8c10d0951a569ca7c0b3197_p.log.

Z-test for a population mean

Add a line to your function that allows you to also process numeric vectors containing missing values!

   1 > ztest <- function(x,x.sd,mu=0){
   2 +     x <- x[!is.na(x)]
   3 +     if(length(x) < 3) stop("too few values in x")
   4 +     sqrt(length(x)) * (mean(x)-mu)/x.sd
   5 + }

Z-test for a population mean

The function pnorm(Z) gives the probability of

latex error! exitcode was 1 (signal 0), transscript follows:

This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=latex)
entering extended mode
(./latex_0d342107de9429eef76138cc515ca3fe22cb439b_p.tex
LaTeX2e <2018-12-01>
(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls
Document Class: article 2018/09/03 v1.4i Standard LaTeX document class
(/usr/share/texlive/texmf-dist/tex/latex/base/size12.clo))
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amssymb.sty
(/usr/share/texlive/texmf-dist/tex/latex/amsfonts/amsfonts.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsmath.sty
For additional information on amsmath, use the `?' option.
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amstext.sty
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsgen.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsbsy.sty)
(/usr/share/texlive/texmf-dist/tex/latex/amsmath/amsopn.sty))
(/usr/share/texlive/texmf-dist/tex/latex/amscls/amsthm.sty)
(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)
No file latex_0d342107de9429eef76138cc515ca3fe22cb439b_p.aux.
! You can't use `macro parameter character #' in vertical mode.
l.10 {{{#
         !highlight r

Overfull \hbox (7.40698pt too wide) in paragraph at lines 10--20
[]\OT1/cmr/m/n/12 !highlight r > ztest <- func-tion(x,x.sd,mu=0) + x <- x[!is.n
a(x)] + if(length(x)
[1] (./latex_0d342107de9429eef76138cc515ca3fe22cb439b_p.aux) )
(\end occurred inside a group at level 3)

### simple group (level 3) entered at line 10 ({)
### simple group (level 2) entered at line 10 ({)
### simple group (level 1) entered at line 10 ({)
### bottom level
(see the transcript file for additional information)
Output written on latex_0d342107de9429eef76138cc515ca3fe22cb439b_p.dvi (1 page,
 480 bytes).
Transcript written on latex_0d342107de9429eef76138cc515ca3fe22cb439b_p.log.

Z-test for a population mean

Now let the result be a named vector containing the estimated difference, Z, p and the n.

   1 > ztest <- function(x,x.sd,mu=0){
   2 +     x <- x[!is.na(x)]
   3 +     if(length(x) < 3) stop("too few values in x")
   4 +     est.diff <- mean(x)-mu
   5 +     z <- sqrt(length(x)) * (est.diff)/x.sd
   6 +     round(c(diff=est.diff,Z=z,pval=2*pnorm(-abs(z)),n=length(x)),4)
   7 + }
   8 > set.seed(1)
   9 > ztest(rnorm(100),x.sd = 1)
  10 diff        Z     pval        n 

Z-test for a population mean

* Z-test for two population means (variances known and equal) * Z-test for two population means (variances known and unequal) To investigate the statistical significance of the difference between an assumed population mean #!latex \mu_0 and a sample mean #!latex \bar{x}. There is a function z.test() in the BSDA package. * It is necessary that the population variance #!latex \sigma^2 is known. * The test is accurate if the population is normally distributed. If the population is not normal, the test will still give an approximate guide.

Simulation Exercises

* Now sample 100 values from a Normal distribution with mean 10 and standard deviation 2 and use a z-test to compare it against the population mean 10. What is the p-value? * Now do the sampling and the testing 1000 times, what would be the number of statistically significant results? Use replicate() (which is a wrapper of tapply()) or a for() loop! Record at least the p-values and the estimated differences! Use table() to count the p-vals below 0.05. What type of error do you associate with it? What is the smallest absolute difference with a p-value below 0.05? * Repeat the simulation above, change the sample size to 1000 in each of the 1000 samples! How many p-values below 0.05? What is now the smallest absolute difference with a p-value below 0.05?

Simulation Exercises -- Solutions

   1 > ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10)["pval"]
   2 pval 
   3 > ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10)["diff"]
   4 diff 
   5 > ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10)[c("pval","diff")]
   6 pval   diff 

Simulation Exercises -- Solutions

using replicate()\footnotesize

   1 > res <- replicate(1000, ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10))
   2 > res <- as.data.frame(t(res))
   3 > head(res)
   4 diff       Z   pval   n
   5 1 -0.2834 -1.4170 0.1565 100
   6 2  0.2540  1.2698 0.2042 100
   7 3 -0.1915 -0.9576 0.3383 100
   8 4  0.1462  0.7312 0.4646 100
   9 5  0.1122  0.5612 0.5747 100
  10 6 -0.0141 -0.0706 0.9437 100

Simulation Exercises -- Solutions

using replicate() II\footnotesize

   1 > res <- replicate(1000, ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10),
   2 +                        simplify = F)
   3 > res <- as.data.frame(Reduce(rbind,res))
   4 > head(res)
   5 diff       Z   pval   n
   6 init -0.0175 -0.0874 0.9304 100

Simulation Exercises -- Solutions

using for() \scriptsize

   1 > res <- matrix(numeric(2000),ncol=2)
   2 > for(i in seq.int(1000)){
   3 +     res[i,] <- ztest(rnorm(100,mean=10,sd=2),x.sd=2,mu=10)[c("pval","diff")] }
   4 > res <- as.data.frame(res)
   5 > names(res) <- c("pval","diff")
   6 > head(res)
   7 pval    diff
   8 1 0.0591 -0.3775
   9 2 0.2466  0.2317
  10 3 0.6368  0.0944
  11 4 0.5538 -0.1184
  12 5 0.9897 -0.0026
  13 6 0.7748  0.0572

Simulation Exercises -- Solutions

   1 > table(res$pval < 0.05)
   2 FALSE  TRUE 
   3 960    40 
   4 > tapply(abs(res$diff),res$pval < 0.05,summary)
   5 > min(abs(res$diff[res$pval<0.05]))
   6 [1] 0.3928

Simulation Exercises -- Solutions

   1 > res2 <- replicate(1000, ztest(rnorm(1000,mean=10,sd=2),
   2 +                              x.sd=2,mu=10))
   3 > res2 <- as.data.frame(t(res2))
   4 > head(res2)
   5 diff       Z   pval    n
   6 1 -0.0731 -1.1559 0.2477 1000
   7 2  0.0018  0.0292 0.9767 1000
   8 3  0.0072  0.1144 0.9089 1000
   9 4 -0.1145 -1.8100 0.0703 1000
  10 5 -0.1719 -2.7183 0.0066 1000
  11 6  0.0880  1.3916 0.1640 1000

Simulation Exercises -- Solutions

   1 > table(res2$pval < 0.05)
   2 FALSE  TRUE 
   3 946    54 
   4 > tapply(abs(res2$diff),res$pval < 0.05,summary)

Simulation Exercises Part II

* Concatenate the both resulting data frames from above using rbind() * Plot the distributions of the pvals and the difference per sample size. Use ggplot2 with an appropriate geom (density/histogram) * What is the message?

Simulation Exercises -- Solutions

   1 > res <- rbind(res,res2)  
   2 > require(ggplot2)
   3 > ggplot(res,aes(x=pval)) +
   4 +     geom_histogram(bin=0.1,fill="forestgreen") +
   5 +     facet_grid(~ n)
   6 > ggsave("hist.png")

Simulation Exercises -- Solutions

<img alt='sesssion2/hist.png' src='-1' />

Simulation Exercises -- Solutions

   1 > ggplot(res,aes(x=diff,colour=factor(n))) +
   2 +     geom_density(size=3)
   3 > ggsave("dens.png")

Simulation Exercises -- Solutions

<img alt='sesssion2/dens.png' src='-1' />

Simulation Exercises -- Solutions

<img alt='sesssion2/point.png' src='-1' />

Simulation Exercises -- Solutions

<img alt='sesssion2/dens2d.png' src='-1' />

   1 > set.seed(1)
   2 > x <- rnorm(12)
   3 > t.test(x,mu=0) ## population mean 0
   4 t = 1.1478, df = 11, p-value = 0.2754
   5 alternative hypothesis: true mean is not equal to 0
   6 95 percent confidence interval:
   7 sample estimates:
   8 mean of x 

   1 > t.test(x,mu=1) ## population mean 1
   2 t = -3.125, df = 11, p-value = 0.009664
   3 alternative hypothesis: true mean is not equal to 1
   4 95 percent confidence interval:
   5 sample estimates:
   6 mean of x 

t-tests

A t-test is any statistical hypothesis test in which the test statistic follows a \emph{Student's t distribution} if the null hypothesis is supported.

One Sample t-test

Two Sample t-tests

There are two ways to perform a two sample t-test in R:

Two Sample t-tests: two vector syntax

   1 > set.seed(1)
   2 > x <- rnorm(12)
   3 > y <- rnorm(12)
   4 > g <- sample(c("A","B"),12,replace = T)
   5 > t.test(x,y)
   6 t = 0.5939, df = 20.012, p-value = 0.5592
   7 alternative hypothesis: true difference in means is not equal to 0
   8 95 percent confidence interval:
   9 sample estimates:
  10 mean of x  mean of y 

Two Sample t-tests: formula syntax

   1 > t.test(x ~ g)
   2 t = -0.6644, df = 6.352, p-value = 0.5298
   3 alternative hypothesis: true difference in means is not equal to 0
   4 95 percent confidence interval:
   5 sample estimates:
   6 mean in group A mean in group B 

Welch/Satterthwaite vs. Student

Student's t-test

   1 > t.test(x, y, var.equal = T)
   2 t = 0.5939, df = 22, p-value = 0.5586
   3 alternative hypothesis: true difference in means is not equal to 0
   4 95 percent confidence interval:
   5 sample estimates:
   6 mean of x  mean of y 

t-test

Exercises

* use a t-test to compare TTime according to Stim.Type, visualize it. What is the problem? * now do the same for Subject 1 on pre and post test (use filter() or indexing to get the resp. subsets) * use the following code to do the test on every subset Subject and testid, try to figure what is happening in each step:\tiny

   1 tob <- t.test(x$TTime ~ x$Stim.Type)
   2 tmp <- data.frame(
   3 Subject = unique(x$Subject),
   4 testid = unique(x$testid),
   5 pval = tob$p.value,
   6 alternative = tob$alternative,
   7 res <- Reduce(rbind,tmp.l)

* make plots to visualize the results. * how many tests have a statistically significant result? How many did you expect? Is there a tendency? What could be the next step?

Exercises - Solutions

   1 > t.test(data$TTime ~ data$Stim.Type)
   2 t = -6.3567, df = 9541.891, p-value = 2.156e-10
   3 alternative hypothesis: true difference in means is not equal to 0
   4 95 percent confidence interval:
   5 sample estimates:
   6 mean in group hit mean in group incorrect 
   7 > ggplot(data,aes(x=Stim.Type,y=TTime)) +
   8 +     geom_boxplot()

Exercises - Solutions

   1 > t.test(data$TTime[data$Subject==1 & data$testid=="test1"] ~
   2 +        data$Stim.Type[data$Subject==1 & data$testid=="test1"])
   3 t = -0.5846, df = 44.183, p-value = 0.5618
   4 alternative hypothesis: true difference in means is not equal to 0
   5 95 percent confidence interval:
   6 sample estimates:
   7 mean in group hit mean in group incorrect 
   8 > t.test(data$TTime[data$Subject==1 & data$testid=="test2"] ~
   9 +        data$Stim.Type[data$Subject==1 & data$testid=="test2"])
  10 t = -1.7694, df = 47.022, p-value = 0.08332
  11 alternative hypothesis: true difference in means is not equal to 0
  12 95 percent confidence interval:
  13 sample estimates:
  14 mean in group hit mean in group incorrect 

Exercises - Solutions

   1 > ggplot(data,aes(x=testid,y=TTime)) +
   2 +    geom_boxplot(aes(fill=Stim.Type)) +
   3 +    facet_wrap(~Subject)
   4 > ggplot(data,aes(x=factor(Subject),y=TTime)) +
   5 +    geom_boxplot(aes(fill=Stim.Type)) +
   6 +    facet_wrap(~testid)

Exercises - Solutions

   1 > table(res$pval < 0.05)
   2 FALSE  TRUE 
   3 165    22 
   4 > prop.table(table(res$pval < 0.05))
   5 FALSE      TRUE 
   6 > 

Exercises - Solutions

   1 tob <- t.test(x$TTime ~ x$Stim.Type)
   2 tmp <- data.frame(
   3 Subject = unique(x$Subject),
   4 testid = unique(x$testid),
   5 pval = tob$p.value,
   6 alternative = tob$alternative,
   7 res <- Reduce(rbind,tmp.l)