Unterschiede zwischen den Revisionen 2 und 4 (über 2 Versionen hinweg)

Introduction

Every function in R has three important characteristics:

a body (the code inside the function) - body()
arguments (the list of arguments which controls how you can call the function) - formals()
an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

Functions

   1 > chisq.test
   2 function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
   3 DNAME <- deparse(substitute(x))
   4 if (is.data.frame(x))
   5 expected = E, residuals = (x - E)/sqrt(E), stdres = (x -
   6 > sum
   7 function (..., na.rm = FALSE)  .Primitive("sum")

Function Arguments

Arguments are matched

first by exact name (perfect matching)
then by prefix matching
and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

   1 > f <- function(x) {
   2 f <- function(x) {
   3 +   10
   4 + }
   5 > f(stop("This is an error!"))
   6 [1] 10
   7 >

Implicit Loops

Introduction

A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:

lapply()
sapply()
apply()
tapply()

lapply()

The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:

   1 > lapply(dat,mean)
   2 [1] 6753.636
   3 [1] 5433.182
   4 > sapply(dat,mean)

apply()

apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example

   1 > x<-1:12
   2 > dim(x)<-c(2,2,3)
   3 > apply(x,3,quantile) ## calculate the quantiles

tapply()

The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.

For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:

   1 > tapply(Days,list(Eth,Lrn),mean)
   2 AL       SL
   3 A 18.57500 24.89655
   4 N 13.25581 10.82353

the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame
do the same with sapply() what is the difference
try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function

   1 m <- matrix(rnorm(100),nrow=10)

use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling

   1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
   2 Sex Eth     Days
   3 1   F   A 20.92105
   4 2   M   A 21.61290
   5 3   F   N 10.07143
   6 4   M   N 14.71429
   7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
   8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
   9 1   F   A      0.00         5.25       13.50     20.92        30.25     81.00
  10 2   M   A      2.00         9.50       16.00     21.61        33.00     57.00
  11 3   F   N      0.00         5.00        7.00     10.07        14.00     37.00
  12 4   M   N      0.00         3.50        8.00     14.71        19.50     69.00

Function Exercises (Verzani)

Write a function to compute the average distance from the mean for some data vector.
Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10)
An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()?
Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0.

Function Exercises (Verzani) Solutions

Write a function to compute the average distance from the mean for some data vector.

   1 > avg.dist <- function(x){
   2 +     xbar <- mean(x)
   3 +     mean(abs(x-xbar))
   4 + }

Function Exercises (Verzani) Solutions

Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}

   1 > f <- function(x){
   2 +     mean(x**2) - mean(x)**2
   3 + }
   4 > f(1:10)
   5 [1] 8.25

Function Exercises (Verzani) Solutions

An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?

   1 > iseven <- function(x){
   2 +     x %% 2 == 0
   3 + }
   4 > iseven(1:10)
   5 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
   6 > isodd <- function(x){
   7 +     !iseven(x)
   8 + }
   9 > isodd(1:10)
  10 [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Function Exercises (Verzani) Solutions

Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

   1 > isprime <- function(x){
   2 +     if(x == 2) return(TRUE)
   3 +     !(0 %in% (x %% (2:(x-1))))
   4 + }
   5 > isprime(2)
   6 [1] TRUE
   7 > isprime(5)
   8 [1] TRUE
   9 > isprime(15)
  10 [1] FALSE

RstatisTik/RstatisTikPortal/RcourSe/CourseOutline/FunctionsInR/ApplyR (zuletzt geändert am 2015-05-01 10:48:36 durch mandy.vogel@googlemail.com)

-  ⇤ ← Revision 2 vom 2015-05-01 08:25:45 → 
  Größe: 32164
  Autor: mandy.vogel@googlemail.com
  Kommentar:
+   ← Revision 4 vom 2015-05-01 09:10:11 → ⇥
  Größe: 6351
  Autor: mandy.vogel@googlemail.com
  Kommentar:
-Gelöschter Text ist auf diese Art markiert.
+Hinzugefügter Text ist auf diese Art markiert.
 Zeile 2:
-== Implicit Loops ==
A common application of loops is  to apply a function to each element of a set of values and collect the results in a single structure.
In R this is done by the functions:
   * lapply()
   * sapply()
   * apply()
   * tapply()
+Every function in R has three important characteristics:

 * a body (the code inside the function) - body()
 * arguments (the list of arguments which controls how you can call the function) - formals()
 * an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

== Functions ==
{{{#!highlight r
> chisq.test
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
DNAME <- deparse(substitute(x))
if (is.data.frame(x))
expected = E, residuals = (x - E)/sqrt(E), stdres = (x -
> sum
function (..., na.rm = FALSE)  .Primitive("sum")
}}}
== Function Arguments ==
Arguments are matched

 * first by exact name (perfect matching)
 * then by prefix matching
 * and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

{{{#!highlight r
> f <- function(x) {
f <- function(x) {
+   10
+ }
> f(stop("This is an error!"))
[1] 10
>
}}}



= Implicit Loops =
== Introduction ==

A common application of loops is  to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:

 * lapply()
 * sapply()
 * apply()
 * tapply()
-Zeile 10:
+Zeile 52:
-   * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s").  For example:
+ * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s").  For example:
-Zeile 18:
+Zeile 61:
-   * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
+ * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
-Zeile 22:
+Zeile 66:
 > apply(x,3,quantile) ## calculate the quantiles
-Zeile 25:
+Zeile 69:
-   * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
+ * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
-Zeile 27:
+Zeile 72:
-Zeile 33:
+Zeile 79:
-==  ==
*  the class() function shows the class of an object use it in combination with lapply() to get the  classes of the columns of the quine data frame
*  do the same with sapply()  what is the difference
* try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
*  calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
+ *  the class() function shows the class of an object use it in combination with lapply() to get the  classes of the columns of the quine data frame 
 *  do the same with sapply()  what is the difference 
 * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors 
 *  calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
-Zeile 39:
+Zeile 85:
 m <- matrix(rnorm(100),nrow=10)
-Zeile 41:
+Zeile 87:
-*  use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines)
*  sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling
+ *  use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) *  sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling
-Zeile 55:
+Zeile 101:
-M   N      0.00         3.50        8.00     14.71        19.50     69.00  
}}}
== Functions ==
Every function in R has three important characteristics:
   * a body (the code inside the function) - body()
   * arguments (the list of arguments which controls how you can call the function) - formals()
   * an environment (the “map” of the location of the function’s variables) - environment()
You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.
== Functions ==
{{{#!highlight r
> chisq.test
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), 
DNAME <- deparse(substitute(x))
if (is.data.frame(x)) 
expected = E, residuals = (x - E)/sqrt(E), stdres = (x - 
> sum
function (..., na.rm = FALSE)  .Primitive("sum")
}}}
== Function Arguments ==
Arguments are matched 
   * first by exact name (perfect matching)
   * then by prefix matching
   * and finally by position.
By default, R function arguments are lazy, they are only evaluated if they are actually used:
{{{#!highlight r
> f <- function(x) {
f <- function(x) {
+   10
+ }
> f(stop("This is an error!"))
[1] 10
>
+M   N      0.00         3.50        8.00     14.71        19.50     69.00
-Zeile 89:
+Zeile 104:
-* Write a function to compute the average distance from the mean for some data vector.
* Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
* An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
* Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. 
== Function Exercises (Verzani) ==
   * Write a function to compute the average distance from the mean for some data vector.
+ * Write a function to compute the average distance from the mean for some data vector. 
 * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10) 
 * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()? 
 * Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0.

== Function Exercises (Verzani) Solutions ==
 * Write a function to compute the average distance from the mean for some data vector.
-Zeile 99:
+Zeile 116:
 + }
-Zeile 101:
+Zeile 118:
-== Function Exercises (Verzani) ==
   * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
+== Function Exercises (Verzani) Solutions ==
 * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
-Zeile 108:
+Zeile 126:
 [1] 8.25
-Zeile 110:
+Zeile 128:
-== Function Exercises (Verzani) ==
   * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
+== Function Exercises (Verzani) Solutions ==
 * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
-Zeile 124:
+Zeile 143:
-== Function Exercises (Verzani) ==
   * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
+== Function Exercises (Verzani) Solutions ==
 * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
-Zeile 136:
+Zeile 156:
 [1] FALSE
-Zeile 138:
+Zeile 158:
-== Read in the file ==
{{{#!highlight r
> file <- "../session1/session1data/pre001.txt"
> skip <- 3
> tmp <- read.table(file,skip = skip,sep = "\t",
+                   header=T,na.strings = c(" +",""),
+                   fill=T)
}}}
== Remove empty line ==
{{{#!highlight r
> tmp <- tmp[!is.na(tmp$Subject),] 
}}}
== Remove spaces ==
Remove unnecessary spaces from character vectors/factors
{{{#!highlight r
> tmp <- lapply(tmp,function(x) {
+         if( class(x) %in% c("character","factor") ){
+             x <- factor(gsub(" ","",as.character(x)))
+             return(x)}else{ return(x) }})
> tmp <- as.data.frame(tmp)
}}}
== Find/Remove breaks ==
{{{#!highlight r
> if(length(pause)>0){
+     drei <- which(tmp$Code==3 & !is.na(tmp$Code))
+     drei <- drei[drei > pause][1:2]
+     if(pause + 1 < drei[1]){
+         tmp <- tmp[-(pause:drei[2]),]
+     }}
> tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]
}}}
== Find/Remove first/last rows ==
{{{#!highlight r
> first.pic <- min(which(tmp$Event.Type=="Picture" & 
+                           !is.na(tmp$Event.Type) )) - 1 
> tmp <- tmp[-(1:first.pic),]
> last.pic <- min(which(tmp$Event.Type=="Picture" & 
+                           !is.na(tmp$Event.Type) &
+                           tmp$Code=="Fertig!" & 
+                           !is.na(tmp$Code)))
> tmp <- tmp[-(last.pic:nrow(tmp)),]
}}}
== Extract Responses ==
{{{#!highlight r
> zeilen <- which(tmp$Event.Type %in% c("Response"))
> zeilen <- sort(unique(c(zeilen,zeilen-1)))
> zeilen <- zeilen[zeilen>0]
> tmp <- tmp[zeilen,]
}}}
== Extract Responses ==
{{{#!highlight r
> responses <- which(tmp$Code %in% c(1,2))
> events <- responses-1
> tmp$Type <- NA
> tmp$Type[responses] <- as.character(tmp$Event.Type[events])
> head(tmp)
Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
7   PRE001     7   Response        2 178963 10009           1       NA
11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
12  PRE001    12   Response        1 238680  8342           1       NA
16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
17  PRE001    17   Response        2 297789  8066           1       NA
6              2       0   next incorrect          7    <NA>
7             NA      NA   <NA>      <NA>         NA Picture
11             2       0   next incorrect         12    <NA>
12            NA      NA   <NA>      <NA>         NA Picture
16             2       0   next       hit         17    <NA>
17            NA      NA   <NA>      <NA>         NA Picture
}}}
== Moving Information ==
Moving all (necessary) information to the response lines.
{{{#!highlight r
> tmp$Event.Code <- NA
> tmp$Event.Code[responses] <- as.character(tmp$Code[events])
> tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
> tmp$Duration[responses] <- as.character(tmp$Duration[events])
> tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
> tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
> tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
> tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
> tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
}}}
== Moving Information ==
{{{#!highlight r
> head(tmp)
Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
7   PRE001     7   Response        2 178963 10009           1    10197
11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
12  PRE001    12   Response        1 238680  8342           1     8398
16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
17  PRE001    17   Response        2 297789  8066           1     8198
6              2       0   next incorrect          7    <NA>       <NA>
7              2       0   next incorrect          7 Picture   RO09.jpg
11             2       0   next incorrect         12    <NA>       <NA>
12             2       0   next incorrect         12 Picture   RO20.jpg
16             2       0   next       hit         17    <NA>       <NA>
17             2       0   next       hit         17 Picture   RS28.jpg
}}}
== Keep response lines ==
{{{#!highlight r
> tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
> tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
> head(tmp)
Subject Trial Event.Type Code   Time TTime Uncertainty Duration
7   PRE001     7   Response    2 178963 10009           1    10197
12  PRE001    12   Response    1 238680  8342           1     8398
17  PRE001    17   Response    2 297789  8066           1     8198
22  PRE001    22   Response    1 351321 10811           1    10997
27  PRE001    27   Response    2 403607   713           1      800
32  PRE001    32   Response    1 467793 23709           1    23794
7              2       0   next incorrect          7 Picture   RO09.jpg
12             2       0   next incorrect         12 Picture   RO20.jpg
17             2       0   next       hit         17 Picture   RS28.jpg
22             2       0   next       hit         22 Picture   AT26.jpg
27             2       0   next       hit         27 Picture   RS23.jpg
32             2       0   next       hit         32 Picture   OF04.jpg
}}}
== The Function ==
   * it would be a tedious work to every step for all of the files
   * if we look through the steps the only important thing that we have to change is the file name
   * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip  - we create a function read.file(file):
{{{#!highlight r
tmp <- read.table(file,skip = skip,sep = "\t",
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[!is.na(tmp$Subject),] 
tmp <- lapply(tmp,function(x) {
x <- factor(gsub(" ","",as.character(x)))
tmp <- as.data.frame(tmp)
}}}
== The Function (continued) ==
{{{#!highlight r
pause <- which(tmp$Event.Type=="Picture" & tmp$Code=="Pause")
drei <- which(tmp$Code==3 & !is.na(tmp$Code))
drei <- drei[drei > pause][1:2]
tmp <- tmp[-(pause:drei[2]),]
tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[-(1:first.pic),]
tmp <- tmp[-(last.pic:nrow(tmp)),]
}}}
== The Function (continued) ==
{{{#!highlight r
zeilen <- which(tmp$Event.Type %in% c("Response"))
zeilen <- sort(unique(c(zeilen,zeilen-1)))
zeilen <- zeilen[zeilen>0]
tmp <- tmp[zeilen,]
responses <- which(tmp$Code %in% c(1,2))
events <- responses-1
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
}}}
== The Function (continued) ==
   * we can use this function now to read in the file 
   * and get the processed data frame in one step
   * setting the parameter skip we can read both versions of the file (and should get the same result)
== The Function Exercise ==
* run the function using source()
* use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt}
* use some summary functions like table() or summary to check if they contain the same information
We will learn about a function to compare data frames more exact soon.
== The Function (continued) ==
{{{#!highlight r
> file <- "../session1/session1data/pre001.txt"
> pre1 <- read.file(file,skip=3)
[1] "read ../session1/session1data/pre001.txt"
> file <- "data/pretest/pre_001.txt"
> pre1v2 <- read.file(file,skip=0)
[1] "read ../session2/data/pretest/pre_001.txt"
}}}
== rbind() ==
   * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects
{{{#!highlight r
> x <- data.frame(id=1:3,score=rnorm(3))
> y <- data.frame(id=13:15,score=rnorm(3))
> rbind(x,y)
id       score
1  1  0.71121163
2  2 -0.62973249
3  3  1.17737595
4 13 -0.45074940
5 14 -0.01044197
6 15 -1.05217176
}}}
== cbind() ==
   * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects
{{{#!highlight r
> cbind(x,y)
id      score1      score2     score3
1  1  0.11440705  0.14536778 -1.1773241
2  2 -1.62862651  0.02020604  0.5686415
3  3  0.05335811  0.25462270  0.8844987
4  4 -0.19931734  0.15625511  0.9287316
5  5 -1.15217836 -1.79804503 -0.7550234
}}}
   * it is not recommended to use cbind() to combining data frames
== merge() ==
   * merge() is the command of choice for merging or joining data frames
   * it is the equivalent of join in sql
   * there are four cases
      * inner join
      * left outer join
      * right outer join
      * full outer join
{{{#!highlight r
> (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
id day1
1  A    3
2  B    4
3  C    5
> (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
id day2
1  A    7
2  C   10
3  E    3
4  F    6
}}}
== inner join ==
   * inner join means: keep only the cases present in both of the data frames
{{{#!highlight r
> merge(d1,d2)
id day1 day2
1  A    3    7
2  C    5   10
}}}
== left outer join ==
   * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)
{{{#!highlight r
> merge(d1,d2,all.x = T)
id day1 day2
1  A    3    7
2  B    4   NA
3  C    5   10
}}}
== right outer join ==
   * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)
{{{#!highlight r
> merge(d1,d2,all.y = T)
id day1 day2
1  A    3    7
2  C    5   10
3  E   NA    3
4  F   NA    6
}}}
== full outer join ==
   * full outer join means: keep all cases of both data frames (all=T)
{{{#!highlight r
> merge(d1,d2,all = T)
id day1 day2
1  A    3    7
2  B    4   NA
3  C    5   10
4  E   NA    3
5  F   NA    6
}}}
== merge() ==
   * if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
   * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
   * using\\ \texttt{by.x=c("colname1.x","colname2.x"),
== merge() Exercise ==
   * now read in the file personendaten.txt using the appropriate command
   * join the demographics with our pre1 data frame (even though it does not make sense now)
== merge() Exercise ==
{{{#!highlight r
> persdat <- read.table("../session1/session1data/personendaten.txt",
+                       sep="\t",
+                       header=T)
> pre1 <- merge(persdat,pre1,all.y = T)
> head(pre1)
Subject Sex Age_PRETEST Trial Event.Type Code   Time TTime Uncertainty
1  PRE001   f        3.11     7   Response    2 178963 10009           1
2  PRE001   f        3.11    12   Response    1 238680  8342           1
3  PRE001   f        3.11    17   Response    2 297789  8066           1
4  PRE001   f        3.11    22   Response    1 351321 10811           1
5  PRE001   f        3.11    27   Response    2 403607   713           1
6  PRE001   f        3.11    32   Response    1 467793 23709           1
Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index    Type Event.Code
1    10197             2       0   next incorrect          7 Picture   RO09.jpg
2     8398             2       0   next incorrect         12 Picture   RO20.jpg
3     8198             2       0   next       hit         17 Picture   RS28.jpg
4    10997             2       0   next       hit         22 Picture   AT26.jpg
5      800             2       0   next       hit         27 Picture   RS23.jpg
6    23794             2       0   next       hit         32 Picture   OF04.jpg
}}}
== Reduce() ==
   * is a higher order function (functional)
   * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
   * it can be used if you have not only two but many data frames
== Reduce() ==
   * first we make up 4 artifical data frames
== Reduce() ==
{{{#!highlight r
> (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
id day1
1  A    3
2  B    1
3  C    7
> (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
id day2
1  A    8
2  C    2
3  E    5
4  F    3
> (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
id day3
1  B    8
2  D    3
3  E    4
4  F   10
> (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
id day4
1  A    2
2  B    7
3  C    8
4  D    9
5  E    1
}}}
== Reduce() ==
   * now we use Reduce() in combination with merge()
{{{#!highlight r
> Reduce(merge,list(d1,d2,d3,d4))
[1] id   day1 day2 day3 day4
}}}
   * and what we get is an empty data frame
   * well this isn't exactly what we wanted, so why?
   * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
   * so we have to define a wrapper function which only change this argument to all=T
== Reduce() ==
   * now we use Reduce() in combination with merge()
{{{#!highlight r
> Reduce(function(x,y) { merge(x,y, all=T) },
+        list(d1,d2,d3,d4))
id day1 day2 day3 day4
1  A    3    8   NA    2
2  B    1   NA    8    7
3  C    7    2   NA    8
4  E   NA    5    4    1
5  F   NA    3   10   NA
6  D   NA   NA    3    9
}}}
   * which is exactly what we want
== Reduce() ==
   * a second example in combination with rbind()
{{{#!highlight r
> d4$day <- names(d4)[2]
> names(d4)[2] <- "score"
> Reduce(function(x,y) { y$day <- names(y)[2]
+                        names(y)[2] <- "score"
+                        rbind(x,y) } ,
+        list(d1,d2,d3), init = d4)
id score  day
1   A     2 day4
2   B     7 day4
3   C     8 day4
4   D     9 day4
}}}
   * which is exactly what we want
== A second function ==
   * well that's better, but it is still boring to do this for every single file
   * so see what we have learned: the combination of lapply() and Reduce() can do the work 
   * using dir{} we get all the files contained in a given directory
   * then we use lapply() together with our new function read.file()
== dir() ==
   * dir() without additional argument shows all files/directories in the working directory
{{{#!highlight r
> dir()
[1] "data"                  "function.r"            "function.r~"          
[4] "ggp1.pdf"              "graphics.r"            "linkimage.aux"        
[7] "session2apply.aux"     "session2apply.log"     "session2apply.nav"    
[10] "session2apply.out"     "session2apply.pdf"     "session2apply.snm"    
[13] "session2apply.tex"     "session2apply.tex~"    "#session2apply.tex#"  
[16] "session2apply.toc"     "session2apply.vrb"     "session2hadley.aux"   
[19] "session2hadley.log"    "session2hadley.nav"    "session2hadley.out"   
[22] "session2hadley.pdf"    "session2hadley.snm"    "session2hadley.tex"   
[25] "session2hadley.tex~"   "session2hadley.toc"    "session2hadley.vrb"   
[28] "solutionssession1.r"   "solutionssession1.r~"  "solutionssession2.r"  
[31] "solutionssession2.r~"  "#solutionssession2.r#"
}}}
== dir() ==
   * given a path dir() will show the content of resp folder
{{{#!highlight r
> dir("data")
[1] "posttest"   "pretest"    "training_1" "training_2" "training_3"
[6] "training_4" "training_5" "training_6" "training_7" "training_8"
}}}
== dir() ==
   * setting recursive to TRUE R will recurse into directories recursively through 
{{{#!highlight r
> dir("data",recursive = T)
[1] "posttest/post_001.txt"      "posttest/post_002.txt"     
[3] "posttest/post_003.txt"      "posttest/post_004.txt"     
[5] "posttest/post_005.txt"      "posttest/post_006.txt"     
[7] "posttest/post_007.txt"      "posttest/post_008.txt"     
}}}
== dir() ==
   * setting full.names to TRUE R will give the full path
{{{#!highlight r
> dir("data",recursive = T, full.names = T)
[1] "data/posttest/post_001.txt"      "data/posttest/post_002.txt"     
[3] "data/posttest/post_003.txt"      "data/posttest/post_004.txt"     
[5] "data/posttest/post_005.txt"      "data/posttest/post_006.txt"     
}}}
== dir() ==
   * with pattern we can specify which files to show (regexpr), e.g. all r files 
{{{#!highlight r
> dir(pattern = "\\.r$")
[1] "function.r"          "graphics.r"          "solutionssession1.r"
[4] "solutionssession2.r"
}}}
== dir() Exercise ==
   * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list
== dir() Exercise ==
{{{#!highlight r
> dir("data",full.names = T, recursive = T,pattern = "txt$"
+ )
[1] "data/posttest/post_001.txt"    "data/posttest/post_002.txt"   
[3] "data/posttest/post_003.txt"    "data/posttest/post_004.txt"   
[5] "data/posttest/post_005.txt"    "data/posttest/post_006.txt"   
> files <- dir("data",full.names = T, recursive = T,pattern = "txt$")
}}}
== Read all files ==
Now we use lapply() and our function read.file() to read all files in files
{{{#!highlight r
> df.list <- lapply(files,read.file,skip=0)
[1] "read data/posttest/post_001.txt"
[1] "read data/posttest/post_002.txt"
[1] "read data/posttest/post_003.txt"
[1] "read data/posttest/post_004.txt"
[1] "read data/posttest/post_005.txt"
}}}
== Reading all files ==
   * the object df.list is a list containing 192 data frames
{{{#!highlight r
> sapply(df.list,class)
[1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[51] "data.frame" "NULL"       "data.frame" "data.frame" "data.frame"
[56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
}}}
== The Function 2 ==
   * in a last step we use Reduce{} to combine these 192 data frames
{{{#!highlight r
> data <- Reduce(rbind,df.list)
> nrow(data)
[1] 12704
> table(data$Subject)
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
93        91        96        93        95        95        93        96 
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 
92        94        95        96        96        95        96        94 
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 
95        94        96        95        95        95        96        94 
005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 
96        95        94        90        96        95        91        96 
013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1 
95        96        95        91        96        96        96        96 
001_1     002_1     003_1     004_1     005_1     006_1    CHGU_1    008_1a 
60        59        60        54        60        59        60        60 
009_1     010_1     RMK_1     013_1     014_1     015_1     016_1    IJ2K_1 
60        60        59        59        60        58        59        58 
018_1     019_1     020_1     001_2     002_2     003_2     004_2     005_2 
60        59        60        59        59        57        58        57 
006_2     007_2     008_2     009_2     010_2     011_2     012_2     013_2 
58        58        54        58        58        59        59        56 
014_2     015_2     016_2     017_2     018_2     019_2     020_2     001_3 
}}}
== The Function no 2 ==
   * so it is recommended to build again a function out of this
{{{#!highlight r
> read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
+     files <- dir(filesdir,
+                  full.names = T,
+                  recursive = recursive,
+                  pattern = pattern)
+     Reduce(rbind,lapply(files,read.file,skip=skip))}
> data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
[1] "read data/posttest/post_001.txt"
[1] "read data/posttest/post_002.txt"
[1] "read data/posttest/post_003.txt"
[1] "read data/posttest/post_004.txt"
[1] "read data/posttest/post_005.txt"
}}}
== The Subject column ==
   * table the Subject column again. What is the problem?
== The Subject column ==
{{{#!highlight r
> table(data$Subject)
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
93        91        96        93        95        95        93        96 
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 
92        94        95        96        96        95        96        94 
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 
95        94        96        95        95        95        96        94 
}}}
   * subject and time coded in one variable
== The Subject column ==
   * we create two new variables using the str\_split() function (stringr package)
   * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
   * then correct some of the person ids
{{{#!highlight r
> data$persid <- sapply(data$Subject,function(x)
+     str_split(x,pattern = "_")[[1]][1])
> data$testid <- sapply(data$Subject,function(x)
+     str_split(x,pattern = "_")[[1]][2])
> data$persid[data$persid=="CHGU"] <- "007"
}}}
== The Subject column Exercises ==
   * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!
== The Subject column Exercises ==
{{{#!highlight r
> data$persid[data$persid=="RMK"] <- "011"
> data$persid[data$persid=="IJ2K"] <- "017"
> data$persid[data$persid=="GA3K"] <- "004"
> data$persid[data$persid=="Kj6K"] <- "006"
}}}
== Merging ==
* now read in the file subjectsdemographics.txt using the appropriate command
* join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
== The Subject column Exercises ==
{{{#!highlight r
> persdat <- read.table("data/subjectdemographics.txt",
+                       sep="\t",
+                       header=T)
> persdat$Subject
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> unique(data$persid)
[1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
[13] "013" "014" "015" "016" "017" "018" "019" "020"
> data$persid <- as.numeric(data$persid)
> unique(data$persid)
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid",  :
column name ‘Subject’ is duplicated in the result
> head(data)
Subject Sex Age_PRETEST   Subject Trial Event.Type Code   Time TTime
1       1   f        3.11 001_test2     7   Response    2 103745  2575
2       1   f        3.11 001_test2    12   Response    2 156493  2737
3       1   f        3.11 001_test2    17   Response    2 214772  6630
4       1   f        3.11 001_test2    22   Response    1 262086  5957
5       1   f        3.11 001_test2    27   Response    2 302589   272
6       1   f        3.11 001_test2    32   Response    1 352703  7197
}}}
== Summary Graphics ==
Just run the code and try to understand it. We will cover the ggplot graphics in the next session.
{{{#!highlight r
> ggplot(data,aes(x=factor(Subject),fill=..count..)) +
+     geom_bar() +
+     facet_wrap(~testid)
}}}
<img alt='sesssion2/graph1.png' src='-1' />
== Summary Graphics ==
   * so there are problems in coding of the test id
   * we remove the letters at the end using str\_replace()
{{{#!highlight r
> data$testid <- str_replace(data$testid,"[a-z]$","")
> data$testid <- factor(data$testid,
+                       levels=c("test1","1","2","3","4","5","6","7","8","test2"))
> table(data$Subject,data$testid)
test1  1  2  3  4  5  6  7  8 test2
1     95 60 59 60 59 59 60 60 60    93
2     95 59 59 58 60 60 60 60 60    91
3     96 60 57 60 60 60 60 59 58    96
4     94 54 58 60 60 55 53 60 58    93
5     96 60 57 60 60 60 60 60  0    95
6     95 59 58 59 58 59 55 54 55    95
7     94 60 58 60 58 59 60 59 59    93
8     90 60 54 55 60 60 60 59 60    96
9     96 60 58 59 57  0  0 58 56    92
10    95 60 58 58 60 60 58  0  0    94
11    91 59 59 60 60 57 58 60 60    95
}}}
== Summary Graphics ==
{{{#!highlight r
> ggplot(data,aes(x=factor(Subject),fill=..count..)) +
+     geom_bar() +
+     facet_wrap(~testid)
}}}
<img alt='sesssion2/graph2.png' src='-1' />
== Summary Graphics ==
And another one.
{{{#!highlight r
> ggplot(data,aes(x=testid,fill=Stim.Type)) +
+     geom_bar(position=position_fill()) +
+     facet_wrap(~Subject)
}}}
<img alt='sesssion2/graph3.png' src='-1' />

Quick Links

Search Wiki

Page Tools

Introduction

Functions

Function Arguments

Implicit Loops

Introduction

lapply()

apply()

tapply()

Function Exercises (Verzani)

Function Exercises (Verzani) Solutions

Function Exercises (Verzani) Solutions

Function Exercises (Verzani) Solutions

Function Exercises (Verzani) Solutions