Introduction

Implicit Loops

A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:

lapply()
sapply()
apply()
tapply()

lapply()

The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:

   1 > lapply(dat,mean)
   2 [1] 6753.636
   3 [1] 5433.182
   4 > sapply(dat,mean)

apply()

apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example

   1 > x<-1:12
   2 > dim(x)<-c(2,2,3)
   3 > apply(x,3,quantile) ## calculate the quantiles

tapply()

The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.

For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:

   1 > tapply(Days,list(Eth,Lrn),mean)
   2 AL       SL
   3 A 18.57500 24.89655
   4 N 13.25581 10.82353

* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function

   1 m <- matrix(rnorm(100),nrow=10)

* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling

   1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
   2 Sex Eth     Days
   3 1   F   A 20.92105
   4 2   M   A 21.61290
   5 3   F   N 10.07143
   6 4   M   N 14.71429
   7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
   8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
   9 1   F   A      0.00         5.25       13.50     20.92        30.25     81.00
  10 2   M   A      2.00         9.50       16.00     21.61        33.00     57.00
  11 3   F   N      0.00         5.00        7.00     10.07        14.00     37.00
  12 4   M   N      0.00         3.50        8.00     14.71        19.50     69.00

Functions

Every function in R has three important characteristics:

a body (the code inside the function) - body()
arguments (the list of arguments which controls how you can call the function) - formals()
an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

Functions

   1 > chisq.test
   2 function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), 
   3 DNAME <- deparse(substitute(x))
   4 if (is.data.frame(x)) 
   5 expected = E, residuals = (x - E)/sqrt(E), stdres = (x - 
   6 > sum
   7 function (..., na.rm = FALSE)  .Primitive("sum")

Function Arguments

Arguments are matched

first by exact name (perfect matching)
then by prefix matching
and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

   1 > f <- function(x) {
   2 f <- function(x) {
   3 +   10
   4 + }
   5 > f(stop("This is an error!"))
   6 [1] 10
   7 >

Function Exercises (Verzani)

* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

Function Exercises (Verzani)

Write a function to compute the average distance from the mean for some data vector.

   1 > avg.dist <- function(x){
   2 +     xbar <- mean(x)
   3 +     mean(abs(x-xbar))
   4 + }

Function Exercises (Verzani)

Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}

   1 > f <- function(x){
   2 +     mean(x**2) - mean(x)**2
   3 + }
   4 > f(1:10)
   5 [1] 8.25

Function Exercises (Verzani)

An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?

   1 > iseven <- function(x){
   2 +     x %% 2 == 0
   3 + }
   4 > iseven(1:10)
   5 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
   6 > isodd <- function(x){
   7 +     !iseven(x)
   8 + }
   9 > isodd(1:10)
  10 [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Function Exercises (Verzani)

Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

   1 > isprime <- function(x){
   2 +     if(x == 2) return(TRUE)
   3 +     !(0 %in% (x %% (2:(x-1))))
   4 + }
   5 > isprime(2)
   6 [1] TRUE
   7 > isprime(5)
   8 [1] TRUE
   9 > isprime(15)
  10 [1] FALSE

Read in the file

   1 > file <- "../session1/session1data/pre001.txt"
   2 > skip <- 3
   3 > tmp <- read.table(file,skip = skip,sep = "\t",
   4 +                   header=T,na.strings = c(" +",""),
   5 +                   fill=T)

Remove empty line

   1 > tmp <- tmp[!is.na(tmp$Subject),]

Remove spaces

Remove unnecessary spaces from character vectors/factors

   1 > tmp <- lapply(tmp,function(x) {
   2 +         if( class(x) %in% c("character","factor") ){
   3 +             x <- factor(gsub(" ","",as.character(x)))
   4 +             return(x)}else{ return(x) }})
   5 > tmp <- as.data.frame(tmp)

Find/Remove breaks

   1 > if(length(pause)>0){
   2 +     drei <- which(tmp$Code==3 & !is.na(tmp$Code))
   3 +     drei <- drei[drei > pause][1:2]
   4 +     if(pause + 1 < drei[1]){
   5 +         tmp <- tmp[-(pause:drei[2]),]
   6 +     }}
   7 > tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]

Find/Remove first/last rows

   1 > first.pic <- min(which(tmp$Event.Type=="Picture" & 
   2 +                           !is.na(tmp$Event.Type) )) - 1 
   3 > tmp <- tmp[-(1:first.pic),]
   4 > last.pic <- min(which(tmp$Event.Type=="Picture" & 
   5 +                           !is.na(tmp$Event.Type) &
   6 +                           tmp$Code=="Fertig!" & 
   7 +                           !is.na(tmp$Code)))
   8 > tmp <- tmp[-(last.pic:nrow(tmp)),]

Extract Responses

   1 > zeilen <- which(tmp$Event.Type %in% c("Response"))
   2 > zeilen <- sort(unique(c(zeilen,zeilen-1)))
   3 > zeilen <- zeilen[zeilen>0]
   4 > tmp <- tmp[zeilen,]

Extract Responses

   1 > responses <- which(tmp$Code %in% c(1,2))
   2 > events <- responses-1
   3 > tmp$Type <- NA
   4 > tmp$Type[responses] <- as.character(tmp$Event.Type[events])
   5 > head(tmp)
   6 Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
   7 6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
   8 7   PRE001     7   Response        2 178963 10009           1       NA
   9 11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
  10 12  PRE001    12   Response        1 238680  8342           1       NA
  11 16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
  12 17  PRE001    17   Response        2 297789  8066           1       NA
  13 6              2       0   next incorrect          7    <NA>
  14 7             NA      NA   <NA>      <NA>         NA Picture
  15 11             2       0   next incorrect         12    <NA>
  16 12            NA      NA   <NA>      <NA>         NA Picture
  17 16             2       0   next       hit         17    <NA>
  18 17            NA      NA   <NA>      <NA>         NA Picture

Moving Information

Moving all (necessary) information to the response lines.

   1 > tmp$Event.Code <- NA
   2 > tmp$Event.Code[responses] <- as.character(tmp$Code[events])
   3 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
   4 > tmp$Duration[responses] <- as.character(tmp$Duration[events])
   5 > tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
   6 > tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
   7 > tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
   8 > tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
   9 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])

Moving Information

   1 > head(tmp)
   2 Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
   3 6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
   4 7   PRE001     7   Response        2 178963 10009           1    10197
   5 11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
   6 12  PRE001    12   Response        1 238680  8342           1     8398
   7 16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
   8 17  PRE001    17   Response        2 297789  8066           1     8198
   9 6              2       0   next incorrect          7    <NA>       <NA>
  10 7              2       0   next incorrect          7 Picture   RO09.jpg
  11 11             2       0   next incorrect         12    <NA>       <NA>
  12 12             2       0   next incorrect         12 Picture   RO20.jpg
  13 16             2       0   next       hit         17    <NA>       <NA>
  14 17             2       0   next       hit         17 Picture   RS28.jpg

Keep response lines

   1 > tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
   2 > tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
   3 > head(tmp)
   4 Subject Trial Event.Type Code   Time TTime Uncertainty Duration
   5 7   PRE001     7   Response    2 178963 10009           1    10197
   6 12  PRE001    12   Response    1 238680  8342           1     8398
   7 17  PRE001    17   Response    2 297789  8066           1     8198
   8 22  PRE001    22   Response    1 351321 10811           1    10997
   9 27  PRE001    27   Response    2 403607   713           1      800
  10 32  PRE001    32   Response    1 467793 23709           1    23794
  11 7              2       0   next incorrect          7 Picture   RO09.jpg
  12 12             2       0   next incorrect         12 Picture   RO20.jpg
  13 17             2       0   next       hit         17 Picture   RS28.jpg
  14 22             2       0   next       hit         22 Picture   AT26.jpg
  15 27             2       0   next       hit         27 Picture   RS23.jpg
  16 32             2       0   next       hit         32 Picture   OF04.jpg

The Function

it would be a tedious work to every step for all of the files
if we look through the steps the only important thing that we have to change is the file name
so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):

   1 tmp <- read.table(file,skip = skip,sep = "\t",

The Function (continued)

   1 tmp <- tmp[!is.na(tmp$Subject),] 
   2 tmp <- lapply(tmp,function(x) {
   3 x <- factor(gsub(" ","",as.character(x)))
   4 tmp <- as.data.frame(tmp)

The Function (continued)

   1 pause <- which(tmp$Event.Type=="Picture" & tmp$Code=="Pause")
   2 drei <- which(tmp$Code==3 & !is.na(tmp$Code))
   3 drei <- drei[drei > pause][1:2]
   4 tmp <- tmp[-(pause:drei[2]),]
   5 tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]

The Function (continued)

   1 tmp <- tmp[-(1:first.pic),]
   2 tmp <- tmp[-(last.pic:nrow(tmp)),]

The Function (continued)

   1 zeilen <- which(tmp$Event.Type %in% c("Response"))
   2 zeilen <- sort(unique(c(zeilen,zeilen-1)))
   3 zeilen <- zeilen[zeilen>0]
   4 tmp <- tmp[zeilen,]
   5 responses <- which(tmp$Code %in% c(1,2))
   6 events <- responses-1

The Function (continued)

   1 tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
   2 tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]

The Function (continued)

we can use this function now to read in the file
and get the processed data frame in one step
setting the parameter skip we can read both versions of the file (and should get the same result)

The Function Exercise

* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon.

The Function (continued)

   1 > file <- "../session1/session1data/pre001.txt"
   2 > pre1 <- read.file(file,skip=3)
   3 [1] "read ../session1/session1data/pre001.txt"
   4 > file <- "data/pretest/pre_001.txt"
   5 > pre1v2 <- read.file(file,skip=0)
   6 [1] "read ../session2/data/pretest/pre_001.txt"

rbind()

rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects

   1 > x <- data.frame(id=1:3,score=rnorm(3))
   2 > y <- data.frame(id=13:15,score=rnorm(3))
   3 > rbind(x,y)
   4 id       score
   5 1  1  0.71121163
   6 2  2 -0.62973249
   7 3  3  1.17737595
   8 4 13 -0.45074940
   9 5 14 -0.01044197
  10 6 15 -1.05217176

cbind()

cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects

   1 > cbind(x,y)
   2 id      score1      score2     score3
   3 1  1  0.11440705  0.14536778 -1.1773241
   4 2  2 -1.62862651  0.02020604  0.5686415
   5 3  3  0.05335811  0.25462270  0.8844987
   6 4  4 -0.19931734  0.15625511  0.9287316
   7 5  5 -1.15217836 -1.79804503 -0.7550234

it is not recommended to use cbind() to combining data frames

merge()

merge() is the command of choice for merging or joining data frames
it is the equivalent of join in sql
there are four cases
- inner join
- left outer join
- right outer join
- full outer join

   1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
   2 id day1
   3 1  A    3
   4 2  B    4
   5 3  C    5
   6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
   7 id day2
   8 1  A    7
   9 2  C   10
  10 3  E    3
  11 4  F    6

inner join

inner join means: keep only the cases present in both of the data frames

   1 > merge(d1,d2)
   2 id day1 day2
   3 1  A    3    7
   4 2  C    5   10

left outer join

left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)

   1 > merge(d1,d2,all.x = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  B    4   NA
   5 3  C    5   10

right outer join

right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)

   1 > merge(d1,d2,all.y = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  C    5   10
   5 3  E   NA    3
   6 4  F   NA    6

full outer join

full outer join means: keep all cases of both data frames (all=T)

   1 > merge(d1,d2,all = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  B    4   NA
   5 3  C    5   10
   6 4  E   NA    3
   7 5  F   NA    6

merge()

if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
using\\ \texttt{by.x=c("colname1.x","colname2.x"),

merge() Exercise

now read in the file personendaten.txt using the appropriate command
join the demographics with our pre1 data frame (even though it does not make sense now)

merge() Exercise

   1 > persdat <- read.table("../session1/session1data/personendaten.txt",
   2 +                       sep="\t",
   3 +                       header=T)
   4 > pre1 <- merge(persdat,pre1,all.y = T)
   5 > head(pre1)
   6 Subject Sex Age_PRETEST Trial Event.Type Code   Time TTime Uncertainty
   7 1  PRE001   f        3.11     7   Response    2 178963 10009           1
   8 2  PRE001   f        3.11    12   Response    1 238680  8342           1
   9 3  PRE001   f        3.11    17   Response    2 297789  8066           1
  10 4  PRE001   f        3.11    22   Response    1 351321 10811           1
  11 5  PRE001   f        3.11    27   Response    2 403607   713           1
  12 6  PRE001   f        3.11    32   Response    1 467793 23709           1
  13 Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index    Type Event.Code
  14 1    10197             2       0   next incorrect          7 Picture   RO09.jpg
  15 2     8398             2       0   next incorrect         12 Picture   RO20.jpg
  16 3     8198             2       0   next       hit         17 Picture   RS28.jpg
  17 4    10997             2       0   next       hit         22 Picture   AT26.jpg
  18 5      800             2       0   next       hit         27 Picture   RS23.jpg
  19 6    23794             2       0   next       hit         32 Picture   OF04.jpg

Reduce()

is a higher order function (functional)
Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
it can be used if you have not only two but many data frames

Reduce()

first we make up 4 artifical data frames

Reduce()

   1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
   2 id day1
   3 1  A    3
   4 2  B    1
   5 3  C    7
   6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
   7 id day2
   8 1  A    8
   9 2  C    2
  10 3  E    5
  11 4  F    3
  12 > (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
  13 id day3
  14 1  B    8
  15 2  D    3
  16 3  E    4
  17 4  F   10
  18 > (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
  19 id day4
  20 1  A    2
  21 2  B    7
  22 3  C    8
  23 4  D    9
  24 5  E    1

Reduce()

now we use Reduce() in combination with merge()

   1 > Reduce(merge,list(d1,d2,d3,d4))
   2 [1] id   day1 day2 day3 day4

and what we get is an empty data frame
well this isn't exactly what we wanted, so why?
it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
so we have to define a wrapper function which only change this argument to all=T

Reduce()

now we use Reduce() in combination with merge()

   1 > Reduce(function(x,y) { merge(x,y, all=T) },
   2 +        list(d1,d2,d3,d4))
   3 id day1 day2 day3 day4
   4 1  A    3    8   NA    2
   5 2  B    1   NA    8    7
   6 3  C    7    2   NA    8
   7 4  E   NA    5    4    1
   8 5  F   NA    3   10   NA
   9 6  D   NA   NA    3    9

which is exactly what we want

Reduce()

a second example in combination with rbind()

   1 > d4$day <- names(d4)[2]
   2 > names(d4)[2] <- "score"
   3 > Reduce(function(x,y) { y$day <- names(y)[2]
   4 +                        names(y)[2] <- "score"
   5 +                        rbind(x,y) } ,
   6 +        list(d1,d2,d3), init = d4)
   7 id score  day
   8 1   A     2 day4
   9 2   B     7 day4
  10 3   C     8 day4
  11 4   D     9 day4

which is exactly what we want

A second function

well that's better, but it is still boring to do this for every single file
so see what we have learned: the combination of lapply() and Reduce() can do the work
using dir{} we get all the files contained in a given directory
then we use lapply() together with our new function read.file()

dir()

dir() without additional argument shows all files/directories in the working directory

   1 > dir()
   2 [1] "data"                  "function.r"            "function.r~"          
   3 [4] "ggp1.pdf"              "graphics.r"            "linkimage.aux"        
   4 [7] "session2apply.aux"     "session2apply.log"     "session2apply.nav"    
   5 [10] "session2apply.out"     "session2apply.pdf"     "session2apply.snm"    
   6 [13] "session2apply.tex"     "session2apply.tex~"    "#session2apply.tex#"  
   7 [16] "session2apply.toc"     "session2apply.vrb"     "session2hadley.aux"   
   8 [19] "session2hadley.log"    "session2hadley.nav"    "session2hadley.out"   
   9 [22] "session2hadley.pdf"    "session2hadley.snm"    "session2hadley.tex"   
  10 [25] "session2hadley.tex~"   "session2hadley.toc"    "session2hadley.vrb"   
  11 [28] "solutionssession1.r"   "solutionssession1.r~"  "solutionssession2.r"  
  12 [31] "solutionssession2.r~"  "#solutionssession2.r#"

dir()

given a path dir() will show the content of resp folder

   1 > dir("data")
   2 [1] "posttest"   "pretest"    "training_1" "training_2" "training_3"
   3 [6] "training_4" "training_5" "training_6" "training_7" "training_8"

dir()

setting recursive to TRUE R will recurse into directories recursively through

   1 > dir("data",recursive = T)
   2 [1] "posttest/post_001.txt"      "posttest/post_002.txt"     
   3 [3] "posttest/post_003.txt"      "posttest/post_004.txt"     
   4 [5] "posttest/post_005.txt"      "posttest/post_006.txt"     
   5 [7] "posttest/post_007.txt"      "posttest/post_008.txt"

dir()

setting full.names to TRUE R will give the full path

   1 > dir("data",recursive = T, full.names = T)
   2 [1] "data/posttest/post_001.txt"      "data/posttest/post_002.txt"     
   3 [3] "data/posttest/post_003.txt"      "data/posttest/post_004.txt"     
   4 [5] "data/posttest/post_005.txt"      "data/posttest/post_006.txt"

dir()

with pattern we can specify which files to show (regexpr), e.g. all r files

   1 > dir(pattern = "\\.r$")
   2 [1] "function.r"          "graphics.r"          "solutionssession1.r"
   3 [4] "solutionssession2.r"

dir() Exercise

create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list

dir() Exercise

   1 > dir("data",full.names = T, recursive = T,pattern = "txt$"
   2 + )
   3 [1] "data/posttest/post_001.txt"    "data/posttest/post_002.txt"   
   4 [3] "data/posttest/post_003.txt"    "data/posttest/post_004.txt"   
   5 [5] "data/posttest/post_005.txt"    "data/posttest/post_006.txt"   
   6 > files <- dir("data",full.names = T, recursive = T,pattern = "txt$")

Read all files

Now we use lapply() and our function read.file() to read all files in files

   1 > df.list <- lapply(files,read.file,skip=0)
   2 [1] "read data/posttest/post_001.txt"
   3 [1] "read data/posttest/post_002.txt"
   4 [1] "read data/posttest/post_003.txt"
   5 [1] "read data/posttest/post_004.txt"
   6 [1] "read data/posttest/post_005.txt"

Reading all files

the object df.list is a list containing 192 data frames

   1 > sapply(df.list,class)
   2 [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   3 [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   4 [11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   5 [16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   6 [21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   7 [26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   8 [31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   9 [36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  10 [41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  11 [46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  12 [51] "data.frame" "NULL"       "data.frame" "data.frame" "data.frame"
  13 [56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  14 [61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  15 [66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  16 [71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"

The Function 2

in a last step we use Reduce{} to combine these 192 data frames

   1 > data <- Reduce(rbind,df.list)
   2 > nrow(data)
   3 [1] 12704
   4 > table(data$Subject)
   5 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
   6 93        91        96        93        95        95        93        96 
   7 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 
   8 92        94        95        96        96        95        96        94 
   9 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 
  10 95        94        96        95        95        95        96        94 
  11 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 
  12 96        95        94        90        96        95        91        96 
  13 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1 
  14 95        96        95        91        96        96        96        96 
  15 001_1     002_1     003_1     004_1     005_1     006_1    CHGU_1    008_1a 
  16 60        59        60        54        60        59        60        60 
  17 009_1     010_1     RMK_1     013_1     014_1     015_1     016_1    IJ2K_1 
  18 60        60        59        59        60        58        59        58 
  19 018_1     019_1     020_1     001_2     002_2     003_2     004_2     005_2 
  20 60        59        60        59        59        57        58        57 
  21 006_2     007_2     008_2     009_2     010_2     011_2     012_2     013_2 
  22 58        58        54        58        58        59        59        56 
  23 014_2     015_2     016_2     017_2     018_2     019_2     020_2     001_3

The Function no 2

so it is recommended to build again a function out of this

   1 > read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
   2 +     files <- dir(filesdir,
   3 +                  full.names = T,
   4 +                  recursive = recursive,
   5 +                  pattern = pattern)
   6 +     Reduce(rbind,lapply(files,read.file,skip=skip))}
   7 > data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
   8 [1] "read data/posttest/post_001.txt"
   9 [1] "read data/posttest/post_002.txt"
  10 [1] "read data/posttest/post_003.txt"
  11 [1] "read data/posttest/post_004.txt"
  12 [1] "read data/posttest/post_005.txt"

The Subject column

table the Subject column again. What is the problem?

The Subject column

   1 > table(data$Subject)
   2 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
   3 93        91        96        93        95        95        93        96 
   4 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 
   5 92        94        95        96        96        95        96        94 
   6 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 
   7 95        94        96        95        95        95        96        94

subject and time coded in one variable

The Subject column

we create two new variables using the str\_split() function (stringr package)
becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
then correct some of the person ids

   1 > data$persid <- sapply(data$Subject,function(x)
   2 +     str_split(x,pattern = "_")[[1]][1])
   3 > data$testid <- sapply(data$Subject,function(x)
   4 +     str_split(x,pattern = "_")[[1]][2])
   5 > data$persid[data$persid=="CHGU"] <- "007"

The Subject column Exercises

there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!

The Subject column Exercises

   1 > data$persid[data$persid=="RMK"] <- "011"
   2 > data$persid[data$persid=="IJ2K"] <- "017"
   3 > data$persid[data$persid=="GA3K"] <- "004"
   4 > data$persid[data$persid=="Kj6K"] <- "006"

Merging

* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)

The Subject column Exercises

   1 > persdat <- read.table("data/subjectdemographics.txt",
   2 +                       sep="\t",
   3 +                       header=T)
   4 > persdat$Subject
   5 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
   6 > unique(data$persid)
   7 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
   8 [13] "013" "014" "015" "016" "017" "018" "019" "020"
   9 > data$persid <- as.numeric(data$persid)
  10 > unique(data$persid)
  11 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  12 > data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
  13 In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid",  :
  14 column name ‘Subject’ is duplicated in the result
  15 > head(data)
  16 Subject Sex Age_PRETEST   Subject Trial Event.Type Code   Time TTime
  17 1       1   f        3.11 001_test2     7   Response    2 103745  2575
  18 2       1   f        3.11 001_test2    12   Response    2 156493  2737
  19 3       1   f        3.11 001_test2    17   Response    2 214772  6630
  20 4       1   f        3.11 001_test2    22   Response    1 262086  5957
  21 5       1   f        3.11 001_test2    27   Response    2 302589   272
  22 6       1   f        3.11 001_test2    32   Response    1 352703  7197

Summary Graphics

Just run the code and try to understand it. We will cover the ggplot graphics in the next session.

   1 > ggplot(data,aes(x=factor(Subject),fill=..count..)) +
   2 +     geom_bar() +
   3 +     facet_wrap(~testid)

Summary Graphics

so there are problems in coding of the test id
we remove the letters at the end using str\_replace()

   1 > data$testid <- str_replace(data$testid,"[a-z]$","")
   2 > data$testid <- factor(data$testid,
   3 +                       levels=c("test1","1","2","3","4","5","6","7","8","test2"))
   4 > table(data$Subject,data$testid)
   5 test1  1  2  3  4  5  6  7  8 test2
   6 1     95 60 59 60 59 59 60 60 60    93
   7 2     95 59 59 58 60 60 60 60 60    91
   8 3     96 60 57 60 60 60 60 59 58    96
   9 4     94 54 58 60 60 55 53 60 58    93
  10 5     96 60 57 60 60 60 60 60  0    95
  11 6     95 59 58 59 58 59 55 54 55    95
  12 7     94 60 58 60 58 59 60 59 59    93
  13 8     90 60 54 55 60 60 60 59 60    96
  14 9     96 60 58 59 57  0  0 58 56    92
  15 10    95 60 58 58 60 60 58  0  0    94
  16 11    91 59 59 60 60 57 58 60 60    95

Summary Graphics

   1 > ggplot(data,aes(x=factor(Subject),fill=..count..)) +
   2 +     geom_bar() +
   3 +     facet_wrap(~testid)

Summary Graphics

And another one.

   1 > ggplot(data,aes(x=testid,fill=Stim.Type)) +
   2 +     geom_bar(position=position_fill()) +
   3 +     facet_wrap(~Subject)

Quick Links

Search Wiki

Page Tools

Seiteninhalt hochladen

Introduction

Implicit Loops

lapply()

apply()

tapply()

Functions

Functions

Function Arguments

Function Exercises (Verzani)

Function Exercises (Verzani)

Function Exercises (Verzani)

Function Exercises (Verzani)

Function Exercises (Verzani)

Read in the file

Remove empty line

Remove spaces

Find/Remove breaks

Find/Remove first/last rows

Extract Responses

Extract Responses

Moving Information

Moving Information

Keep response lines

The Function

The Function (continued)

The Function (continued)

The Function (continued)

The Function (continued)

The Function (continued)

The Function (continued)

The Function Exercise

The Function (continued)

rbind()

cbind()

merge()

inner join

left outer join

right outer join

full outer join

merge()

merge() Exercise

merge() Exercise

Reduce()

Reduce()

Reduce()

Reduce()

Reduce()

Reduce()

A second function

dir()

dir()

dir()

dir()

dir()

dir() Exercise

dir() Exercise

Read all files

Reading all files

The Function 2

The Function no 2

The Subject column

The Subject column

The Subject column

The Subject column Exercises

The Subject column Exercises

Merging

The Subject column Exercises

Summary Graphics

Summary Graphics

Summary Graphics

Summary Graphics