welcome: please sign in
location: Änderungen von "RstatisTik/RstatisTikPortal/RcourSe/CourseOutline/FunctionsInR/ApplyR"
Unterschiede zwischen den Revisionen 1 und 3 (über 2 Versionen hinweg)
Revision 1 vom 2015-05-01 08:23:34
Größe: 32159
Kommentar:
Revision 3 vom 2015-05-01 08:27:34
Größe: 31972
Kommentar:
Gelöschter Text ist auf diese Art markiert. Hinzugefügter Text ist auf diese Art markiert.
Zeile 3: Zeile 3:
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure.
In R this is done by the functions:
   * lapply()
   * sapply()
   * apply()
   * tapply()
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:

 * lapply()
 * sapply()
 * apply()
 * tapply()
Zeile 10: Zeile 11:
   * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:  * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:
Zeile 18: Zeile 20:
   * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example  * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
Zeile 22: Zeile 25:
> apply(x,3,quantile) ## calculate the quantiles  > apply(x,3,quantile) ## calculate the quantiles
Zeile 25: Zeile 28:
   * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.  * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
Zeile 27: Zeile 31:
Zeile 33: Zeile 38:
== ==
* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame
* do the same with sapply() what is the difference
* try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
* calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function 
{{{#!highlight r
m <- matrix(rnorm(100),nrow=10)  
}}}
* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines)
* sometimes the aggregate() function is more convenient; note the use of {{{#!latex \sim$; it is read as 'is dependent on'and it is extensively used in modelling
 * the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function

{{{#!highlight r
m <- matrix(rnorm(100),nrow=10)
}}}
 * use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling
Zeile 55: Zeile 57:
4 M N 0.00 3.50 8.00 14.71 19.50 69.00   4 M N 0.00 3.50 8.00 14.71 19.50 69.00
Zeile 59: Zeile 61:
   * a body (the code inside the function) - body()
   * arguments (the list of arguments which controls how you can call the function) - formals()
   * an environment (the “map” of the location of the function’s variables) - environment()

* a body (the code inside the function) - body()
 * arguments (the list of arguments which controls how you can call the function) - formals()
 * an environment (the “map” of the location of the function’s variables) - environment()
Zeile 63: Zeile 67:
Zeile 66: Zeile 71:
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),  function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
Zeile 68: Zeile 73:
if (is.data.frame(x)) 
expected = E, residuals = (x - E)/sqrt(E), stdres = (x - 
if (is.data.frame(x))
expected = E, residuals = (x - E)/sqrt(E), stdres = (x -
Zeile 74: Zeile 79:
Arguments are matched 
   * first by exact name (perfect matching)
   * then by prefix matching
   * and finally by position.
Arguments are matched

* first by exact name (perfect matching)
 * then by prefix matching
 * and finally by position.
Zeile 79: Zeile 86:
Zeile 86: Zeile 94:
>  >
Zeile 89: Zeile 97:
* Write a function to compute the average distance from the mean for some data vector.
* Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
* An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
* Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. 
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Zeile 94: Zeile 100:
   * Write a function to compute the average distance from the mean for some data vector.  * Write a function to compute the average distance from the mean for some data vector.
Zeile 99: Zeile 106:
+ }   + }
Zeile 102: Zeile 109:
   * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}  * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
Zeile 108: Zeile 116:
[1] 8.25   [1] 8.25
Zeile 111: Zeile 119:
   * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?  * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
Zeile 125: Zeile 134:
   * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.   * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Zeile 136: Zeile 146:
[1] FALSE   [1] FALSE
Zeile 148: Zeile 158:
> tmp <- tmp[!is.na(tmp$Subject),]  > tmp <- tmp[!is.na(tmp$Subject),]
Zeile 152: Zeile 162:
Zeile 171: Zeile 182:
> first.pic <- min(which(tmp$Event.Type=="Picture" & 
+ !is.na(tmp$Event.Type) )) - 1 
> first.pic <- min(which(tmp$Event.Type=="Picture" &
+ !is.na(tmp$Event.Type) )) - 1
Zeile 174: Zeile 185:
> last.pic <- min(which(tmp$Event.Type=="Picture" &  > last.pic <- min(which(tmp$Event.Type=="Picture" &
Zeile 176: Zeile 187:
+ tmp$Code=="Fertig!" &  + tmp$Code=="Fertig!" &
Zeile 210: Zeile 221:
Zeile 258: Zeile 270:
   * it would be a tedious work to every step for all of the files
   * if we look through the steps the only important thing that we have to change is the file name
   * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):
 * it would be a tedious work to every step for all of the files
 * if we look through the steps the only important thing that we have to change is the file name
 * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):
Zeile 266: Zeile 279:
tmp <- tmp[!is.na(tmp$Subject),]  tmp <- tmp[!is.na(tmp$Subject),]
Zeile 299: Zeile 312:
   * we can use this function now to read in the file 
   * and get the processed data frame in one step
   * setting the parameter skip we can read both versions of the file (and should get the same result)
 * we can use this function now to read in the file
 * and get the processed data frame in one step
 * setting the parameter skip we can read both versions of the file (and should get the same result)
Zeile 303: Zeile 317:
* run the function using source()
* use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt}
* use some summary functions like table() or summary to check if they contain the same information
We will learn about a function to compare data frames more exact soon.
* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon.
Zeile 317: Zeile 329:
   * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects  * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects
Zeile 331: Zeile 344:
   * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects  * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects
Zeile 341: Zeile 355:
   * it is not recommended to use cbind() to combining data frames  * it is not recommended to use cbind() to combining data frames
Zeile 343: Zeile 358:
   * merge() is the command of choice for merging or joining data frames
   * it is the equivalent of join in sql
   * there are four cases
      * inner join
      * left outer join
      * right outer join
      * full outer join
 * merge() is the command of choice for merging or joining data frames
 * it is the equivalent of join in sql
 * there are four cases
  * inner join
  * left outer join
  * right outer join
  * full outer join
Zeile 364: Zeile 380:
   * inner join means: keep only the cases present in both of the data frames  * inner join means: keep only the cases present in both of the data frames
Zeile 372: Zeile 389:
   * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)  * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)
Zeile 381: Zeile 399:
   * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)  * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)
Zeile 391: Zeile 410:
   * full outer join means: keep all cases of both data frames (all=T)  * full outer join means: keep all cases of both data frames (all=T)
Zeile 402: Zeile 422:
   * if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
   * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
   * using\\ \texttt{by.x=c("colname1.x","colname2.x"),
 * if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
 * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
 * using\\ \texttt{by.x=c("colname1.x","colname2.x"),
Zeile 406: Zeile 427:
   * now read in the file personendaten.txt using the appropriate command
   * join the demographics with our pre1 data frame (even though it does not make sense now)
 * now read in the file personendaten.txt using the appropriate command
 * join the demographics with our pre1 data frame (even though it does not make sense now)
Zeile 431: Zeile 453:
   * is a higher order function (functional)
   * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
   * it can be used if you have not only two but many data frames
 * is a higher order function (functional)
 * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
 * it can be used if you have not only two but many data frames
Zeile 435: Zeile 458:
   * first we make up 4 artifical data frames  * first we make up 4 artifical data frames
Zeile 464: Zeile 488:
   * now we use Reduce() in combination with merge()  * now we use Reduce() in combination with merge()
Zeile 469: Zeile 494:
   * and what we get is an empty data frame
  
* well this isn't exactly what we wanted, so why?
   * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
   * so we have to define a wrapper function which only change this argument to all=T
 * and what we get is an empty data frame
* well this isn't exactly what we wanted, so why?
 * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
 * so we have to define a wrapper function which only change this argument to all=T
Zeile 474: Zeile 500:
   * now we use Reduce() in combination with merge()  * now we use Reduce() in combination with merge()
Zeile 486: Zeile 513:
   * which is exactly what we want  * which is exactly what we want
Zeile 488: Zeile 516:
   * a second example in combination with rbind()  * a second example in combination with rbind()
Zeile 502: Zeile 531:
   * which is exactly what we want  * which is exactly what we want
Zeile 504: Zeile 534:
   * well that's better, but it is still boring to do this for every single file
   * so see what we have learned: the combination of lapply() and Reduce() can do the work 
   * using dir{} we get all the files contained in a given directory
   * then we use lapply() together with our new function read.file()
 * well that's better, but it is still boring to do this for every single file
 * so see what we have learned: the combination of lapply() and Reduce() can do the work
 * using dir{} we get all the files contained in a given directory
 * then we use lapply() together with our new function read.file()
Zeile 509: Zeile 540:
   * dir() without additional argument shows all files/directories in the working directory  * dir() without additional argument shows all files/directories in the working directory
Zeile 512: Zeile 544:
[1] "data" "function.r" "function.r~"          
[4] "ggp1.pdf" "graphics.r" "linkimage.aux"        
[7] "session2apply.aux" "session2apply.log" "session2apply.nav"    
[10] "session2apply.out" "session2apply.pdf" "session2apply.snm"    
[13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#"  
[16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux"   
[19] "session2hadley.log" "session2hadley.nav" "session2hadley.out"   
[22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex"   
[25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb"   
[28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r"  
[1] "data" "function.r" "function.r~"
[4] "ggp1.pdf" "graphics.r" "linkimage.aux"
[7] "session2apply.aux" "session2apply.log" "session2apply.nav"
[10] "session2apply.out" "session2apply.pdf" "session2apply.snm"
[13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#"
[16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux"
[19] "session2hadley.log" "session2hadley.nav" "session2hadley.out"
[22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex"
[25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb"
[28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r"
Zeile 525: Zeile 557:
   * given a path dir() will show the content of resp folder  * given a path dir() will show the content of resp folder
Zeile 532: Zeile 565:
   * setting recursive to TRUE R will recurse into directories recursively through   * setting recursive to TRUE R will recurse into directories recursively through
Zeile 535: Zeile 569:
[1] "posttest/post_001.txt" "posttest/post_002.txt"     
[3] "posttest/post_003.txt" "posttest/post_004.txt"     
[5] "posttest/post_005.txt" "posttest/post_006.txt"     
[7] "posttest/post_007.txt" "posttest/post_008.txt"     
[1] "posttest/post_001.txt" "posttest/post_002.txt"
[3] "posttest/post_003.txt" "posttest/post_004.txt"
[5] "posttest/post_005.txt" "posttest/post_006.txt"
[7] "posttest/post_007.txt" "posttest/post_008.txt"
Zeile 541: Zeile 575:
   * setting full.names to TRUE R will give the full path  * setting full.names to TRUE R will give the full path
Zeile 544: Zeile 579:
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"     
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"     
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"     
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
Zeile 549: Zeile 584:
   * with pattern we can specify which files to show (regexpr), e.g. all r files   * with pattern we can specify which files to show (regexpr), e.g. all r files
Zeile 556: Zeile 592:
   * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list  * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list
Zeile 561: Zeile 598:
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"   
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"   
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"   
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
Zeile 568: Zeile 605:
Zeile 577: Zeile 615:
   * the object df.list is a list containing 192 data frames  * the object df.list is a list containing 192 data frames
Zeile 597: Zeile 636:
   * in a last step we use Reduce{} to combine these 192 data frames  * in a last step we use Reduce{} to combine these 192 data frames
Zeile 603: Zeile 643:
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
93 91 96 93 95 95 93 96 
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 
92 94 95 96 96 95 96 94 
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 
95 94 96 95 95 95 96 94 
005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 
96 95 94 90 96 95 91 96 
013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1  95 96 95 91 96 96 96 96
001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a 
60 59 60 54 60 59 60 60 
009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1 
60 60 59 59 60 58 59 58 
018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2 
60 59 60 59 59 57 58 57 
006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2 
58 58 54 58 58 59 59 56 
014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3 
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
93 91 96 93 95 95 93 96
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
92 94 95 96 96 95 96 94
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
95 94 96 95 95 95 96 94
005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1
96 95 94 90 96 95 91 96
013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1
95 96 95 91 96 96 96 96

001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a
60 59 60 54 60 59 60 60
009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1
60 60 59 59 60 58 59 58
018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2
60 59 60 59 59 57 58 57
006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2
58 58 54 58 58 59 59 56
014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3
Zeile 624: Zeile 664:
   * so it is recommended to build again a function out of this  * so it is recommended to build again a function out of this
Zeile 640: Zeile 681:
   * table the Subject column again. What is the problem?  * table the Subject column again. What is the problem?
Zeile 644: Zeile 686:
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 
93 91 96 93 95 95 93 96 
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2  92 94 95 96 96 95 96 94
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1  95 94 96 95 95 95 96 94
}}}
   * subject and time coded in one variable
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
93 91 96 93 95 95 93 96
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
92 94 95 96 96 95 96 94

017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
95 94 96 95 95 95 96 94

}}}
 * subject and time coded in one variable
Zeile 653: Zeile 696:
   * we create two new variables using the str\_split() function (stringr package)
   * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
   * then correct some of the person ids
 * we create two new variables using the str\_split() function (stringr package)
 * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
 * then correct some of the person ids
Zeile 664: Zeile 708:
   * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!  * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!
Zeile 673: Zeile 718:
* now read in the file subjectsdemographics.txt using the appropriate command
* join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
Zeile 702: Zeile 747:
Zeile 708: Zeile 754:
Zeile 709: Zeile 756:
   * so there are problems in coding of the test id
   * we remove the letters at the end using str\_replace()
 * so there are problems in coding of the test id
 * we remove the letters at the end using str\_replace()
Zeile 736: Zeile 784:
Zeile 738: Zeile 787:

Introduction

Implicit Loops

A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:

  • lapply()
  • sapply()
  • apply()
  • tapply()

lapply()

  • The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:

   1 > lapply(dat,mean)
   2 [1] 6753.636
   3 [1] 5433.182
   4 > sapply(dat,mean)

apply()

  • apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example

   1 > x<-1:12
   2 > dim(x)<-c(2,2,3)
   3 > apply(x,3,quantile) ## calculate the quantiles

tapply()

  • The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.

For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:

   1 > tapply(Days,list(Eth,Lrn),mean)
   2 AL       SL
   3 A 18.57500 24.89655
   4 N 13.25581 10.82353
  • the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function

   1 m <- matrix(rnorm(100),nrow=10)
  • use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of  #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling

   1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
   2 Sex Eth     Days
   3 1   F   A 20.92105
   4 2   M   A 21.61290
   5 3   F   N 10.07143
   6 4   M   N 14.71429
   7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
   8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
   9 1   F   A      0.00         5.25       13.50     20.92        30.25     81.00
  10 2   M   A      2.00         9.50       16.00     21.61        33.00     57.00
  11 3   F   N      0.00         5.00        7.00     10.07        14.00     37.00
  12 4   M   N      0.00         3.50        8.00     14.71        19.50     69.00

Functions

Every function in R has three important characteristics:

  • a body (the code inside the function) - body()
  • arguments (the list of arguments which controls how you can call the function) - formals()
  • an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

Functions

   1 > chisq.test
   2 function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
   3 DNAME <- deparse(substitute(x))
   4 if (is.data.frame(x))
   5 expected = E, residuals = (x - E)/sqrt(E), stdres = (x -
   6 > sum
   7 function (..., na.rm = FALSE)  .Primitive("sum")

Function Arguments

Arguments are matched

  • first by exact name (perfect matching)
  • then by prefix matching
  • and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

   1 > f <- function(x) {
   2 f <- function(x) {
   3 +   10
   4 + }
   5 > f(stop("This is an error!"))
   6 [1] 10
   7 >

Function Exercises (Verzani)

* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

Function Exercises (Verzani)

  • Write a function to compute the average distance from the mean for some data vector.

   1 > avg.dist <- function(x){
   2 +     xbar <- mean(x)
   3 +     mean(abs(x-xbar))
   4 + }

Function Exercises (Verzani)

  • Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}

   1 > f <- function(x){
   2 +     mean(x**2) - mean(x)**2
   3 + }
   4 > f(1:10)
   5 [1] 8.25

Function Exercises (Verzani)

  • An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?

   1 > iseven <- function(x){
   2 +     x %% 2 == 0
   3 + }
   4 > iseven(1:10)
   5 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
   6 > isodd <- function(x){
   7 +     !iseven(x)
   8 + }
   9 > isodd(1:10)
  10 [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Function Exercises (Verzani)

  • Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

   1 > isprime <- function(x){
   2 +     if(x == 2) return(TRUE)
   3 +     !(0 %in% (x %% (2:(x-1))))
   4 + }
   5 > isprime(2)
   6 [1] TRUE
   7 > isprime(5)
   8 [1] TRUE
   9 > isprime(15)
  10 [1] FALSE

Read in the file

   1 > file <- "../session1/session1data/pre001.txt"
   2 > skip <- 3
   3 > tmp <- read.table(file,skip = skip,sep = "\t",
   4 +                   header=T,na.strings = c(" +",""),
   5 +                   fill=T)

Remove empty line

   1 > tmp <- tmp[!is.na(tmp$Subject),]

Remove spaces

Remove unnecessary spaces from character vectors/factors

   1 > tmp <- lapply(tmp,function(x) {
   2 +         if( class(x) %in% c("character","factor") ){
   3 +             x <- factor(gsub(" ","",as.character(x)))
   4 +             return(x)}else{ return(x) }})
   5 > tmp <- as.data.frame(tmp)

Find/Remove breaks

   1 > if(length(pause)>0){
   2 +     drei <- which(tmp$Code==3 & !is.na(tmp$Code))
   3 +     drei <- drei[drei > pause][1:2]
   4 +     if(pause + 1 < drei[1]){
   5 +         tmp <- tmp[-(pause:drei[2]),]
   6 +     }}
   7 > tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]

Find/Remove first/last rows

   1 > first.pic <- min(which(tmp$Event.Type=="Picture" &
   2 +                           !is.na(tmp$Event.Type) )) - 1
   3 > tmp <- tmp[-(1:first.pic),]
   4 > last.pic <- min(which(tmp$Event.Type=="Picture" &
   5 +                           !is.na(tmp$Event.Type) &
   6 +                           tmp$Code=="Fertig!" &
   7 +                           !is.na(tmp$Code)))
   8 > tmp <- tmp[-(last.pic:nrow(tmp)),]

Extract Responses

   1 > zeilen <- which(tmp$Event.Type %in% c("Response"))
   2 > zeilen <- sort(unique(c(zeilen,zeilen-1)))
   3 > zeilen <- zeilen[zeilen>0]
   4 > tmp <- tmp[zeilen,]

Extract Responses

   1 > responses <- which(tmp$Code %in% c(1,2))
   2 > events <- responses-1
   3 > tmp$Type <- NA
   4 > tmp$Type[responses] <- as.character(tmp$Event.Type[events])
   5 > head(tmp)
   6 Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
   7 6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
   8 7   PRE001     7   Response        2 178963 10009           1       NA
   9 11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
  10 12  PRE001    12   Response        1 238680  8342           1       NA
  11 16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
  12 17  PRE001    17   Response        2 297789  8066           1       NA
  13 6              2       0   next incorrect          7    <NA>
  14 7             NA      NA   <NA>      <NA>         NA Picture
  15 11             2       0   next incorrect         12    <NA>
  16 12            NA      NA   <NA>      <NA>         NA Picture
  17 16             2       0   next       hit         17    <NA>
  18 17            NA      NA   <NA>      <NA>         NA Picture

Moving Information

Moving all (necessary) information to the response lines.

   1 > tmp$Event.Code <- NA
   2 > tmp$Event.Code[responses] <- as.character(tmp$Code[events])
   3 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
   4 > tmp$Duration[responses] <- as.character(tmp$Duration[events])
   5 > tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
   6 > tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
   7 > tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
   8 > tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
   9 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])

Moving Information

   1 > head(tmp)
   2 Subject Trial Event.Type     Code   Time TTime Uncertainty Duration
   3 6   PRE001     7    Picture RO09.jpg 168954     0           1    10197
   4 7   PRE001     7   Response        2 178963 10009           1    10197
   5 11  PRE001    12    Picture RO20.jpg 230338     0           1     8398
   6 12  PRE001    12   Response        1 238680  8342           1     8398
   7 16  PRE001    17    Picture RS28.jpg 289723     0           1     8198
   8 17  PRE001    17   Response        2 297789  8066           1     8198
   9 6              2       0   next incorrect          7    <NA>       <NA>
  10 7              2       0   next incorrect          7 Picture   RO09.jpg
  11 11             2       0   next incorrect         12    <NA>       <NA>
  12 12             2       0   next incorrect         12 Picture   RO20.jpg
  13 16             2       0   next       hit         17    <NA>       <NA>
  14 17             2       0   next       hit         17 Picture   RS28.jpg

Keep response lines

   1 > tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
   2 > tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
   3 > head(tmp)
   4 Subject Trial Event.Type Code   Time TTime Uncertainty Duration
   5 7   PRE001     7   Response    2 178963 10009           1    10197
   6 12  PRE001    12   Response    1 238680  8342           1     8398
   7 17  PRE001    17   Response    2 297789  8066           1     8198
   8 22  PRE001    22   Response    1 351321 10811           1    10997
   9 27  PRE001    27   Response    2 403607   713           1      800
  10 32  PRE001    32   Response    1 467793 23709           1    23794
  11 7              2       0   next incorrect          7 Picture   RO09.jpg
  12 12             2       0   next incorrect         12 Picture   RO20.jpg
  13 17             2       0   next       hit         17 Picture   RS28.jpg
  14 22             2       0   next       hit         22 Picture   AT26.jpg
  15 27             2       0   next       hit         27 Picture   RS23.jpg
  16 32             2       0   next       hit         32 Picture   OF04.jpg

The Function

  • it would be a tedious work to every step for all of the files
  • if we look through the steps the only important thing that we have to change is the file name
  • so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):

   1 tmp <- read.table(file,skip = skip,sep = "\t",

The Function (continued)

   1 tmp <- tmp[!is.na(tmp$Subject),]
   2 tmp <- lapply(tmp,function(x) {
   3 x <- factor(gsub(" ","",as.character(x)))
   4 tmp <- as.data.frame(tmp)

The Function (continued)

   1 pause <- which(tmp$Event.Type=="Picture" & tmp$Code=="Pause")
   2 drei <- which(tmp$Code==3 & !is.na(tmp$Code))
   3 drei <- drei[drei > pause][1:2]
   4 tmp <- tmp[-(pause:drei[2]),]
   5 tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]

The Function (continued)

   1 tmp <- tmp[-(1:first.pic),]
   2 tmp <- tmp[-(last.pic:nrow(tmp)),]

The Function (continued)

   1 zeilen <- which(tmp$Event.Type %in% c("Response"))
   2 zeilen <- sort(unique(c(zeilen,zeilen-1)))
   3 zeilen <- zeilen[zeilen>0]
   4 tmp <- tmp[zeilen,]
   5 responses <- which(tmp$Code %in% c(1,2))
   6 events <- responses-1

The Function (continued)

   1 tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
   2 tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]

The Function (continued)

  • we can use this function now to read in the file
  • and get the processed data frame in one step
  • setting the parameter skip we can read both versions of the file (and should get the same result)

The Function Exercise

* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon.

The Function (continued)

   1 > file <- "../session1/session1data/pre001.txt"
   2 > pre1 <- read.file(file,skip=3)
   3 [1] "read ../session1/session1data/pre001.txt"
   4 > file <- "data/pretest/pre_001.txt"
   5 > pre1v2 <- read.file(file,skip=0)
   6 [1] "read ../session2/data/pretest/pre_001.txt"

rbind()

  • rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects

   1 > x <- data.frame(id=1:3,score=rnorm(3))
   2 > y <- data.frame(id=13:15,score=rnorm(3))
   3 > rbind(x,y)
   4 id       score
   5 1  1  0.71121163
   6 2  2 -0.62973249
   7 3  3  1.17737595
   8 4 13 -0.45074940
   9 5 14 -0.01044197
  10 6 15 -1.05217176

cbind()

  • cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects

   1 > cbind(x,y)
   2 id      score1      score2     score3
   3 1  1  0.11440705  0.14536778 -1.1773241
   4 2  2 -1.62862651  0.02020604  0.5686415
   5 3  3  0.05335811  0.25462270  0.8844987
   6 4  4 -0.19931734  0.15625511  0.9287316
   7 5  5 -1.15217836 -1.79804503 -0.7550234
  • it is not recommended to use cbind() to combining data frames

merge()

  • merge() is the command of choice for merging or joining data frames
  • it is the equivalent of join in sql
  • there are four cases
    • inner join
    • left outer join
    • right outer join
    • full outer join

   1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
   2 id day1
   3 1  A    3
   4 2  B    4
   5 3  C    5
   6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
   7 id day2
   8 1  A    7
   9 2  C   10
  10 3  E    3
  11 4  F    6

inner join

  • inner join means: keep only the cases present in both of the data frames

   1 > merge(d1,d2)
   2 id day1 day2
   3 1  A    3    7
   4 2  C    5   10

left outer join

  • left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)

   1 > merge(d1,d2,all.x = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  B    4   NA
   5 3  C    5   10

right outer join

  • right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)

   1 > merge(d1,d2,all.y = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  C    5   10
   5 3  E   NA    3
   6 4  F   NA    6

full outer join

  • full outer join means: keep all cases of both data frames (all=T)

   1 > merge(d1,d2,all = T)
   2 id day1 day2
   3 1  A    3    7
   4 2  B    4   NA
   5 3  C    5   10
   6 4  E   NA    3
   7 5  F   NA    6

merge()

  • if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
  • you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
  • using\\ \texttt{by.x=c("colname1.x","colname2.x"),

merge() Exercise

  • now read in the file personendaten.txt using the appropriate command
  • join the demographics with our pre1 data frame (even though it does not make sense now)

merge() Exercise

   1 > persdat <- read.table("../session1/session1data/personendaten.txt",
   2 +                       sep="\t",
   3 +                       header=T)
   4 > pre1 <- merge(persdat,pre1,all.y = T)
   5 > head(pre1)
   6 Subject Sex Age_PRETEST Trial Event.Type Code   Time TTime Uncertainty
   7 1  PRE001   f        3.11     7   Response    2 178963 10009           1
   8 2  PRE001   f        3.11    12   Response    1 238680  8342           1
   9 3  PRE001   f        3.11    17   Response    2 297789  8066           1
  10 4  PRE001   f        3.11    22   Response    1 351321 10811           1
  11 5  PRE001   f        3.11    27   Response    2 403607   713           1
  12 6  PRE001   f        3.11    32   Response    1 467793 23709           1
  13 Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index    Type Event.Code
  14 1    10197             2       0   next incorrect          7 Picture   RO09.jpg
  15 2     8398             2       0   next incorrect         12 Picture   RO20.jpg
  16 3     8198             2       0   next       hit         17 Picture   RS28.jpg
  17 4    10997             2       0   next       hit         22 Picture   AT26.jpg
  18 5      800             2       0   next       hit         27 Picture   RS23.jpg
  19 6    23794             2       0   next       hit         32 Picture   OF04.jpg

Reduce()

  • is a higher order function (functional)
  • Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
  • it can be used if you have not only two but many data frames

Reduce()

  • first we make up 4 artifical data frames

Reduce()

   1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
   2 id day1
   3 1  A    3
   4 2  B    1
   5 3  C    7
   6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
   7 id day2
   8 1  A    8
   9 2  C    2
  10 3  E    5
  11 4  F    3
  12 > (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
  13 id day3
  14 1  B    8
  15 2  D    3
  16 3  E    4
  17 4  F   10
  18 > (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
  19 id day4
  20 1  A    2
  21 2  B    7
  22 3  C    8
  23 4  D    9
  24 5  E    1

Reduce()

  • now we use Reduce() in combination with merge()

   1 > Reduce(merge,list(d1,d2,d3,d4))
   2 [1] id   day1 day2 day3 day4
  • and what we get is an empty data frame
  • well this isn't exactly what we wanted, so why?
  • it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
  • so we have to define a wrapper function which only change this argument to all=T

Reduce()

  • now we use Reduce() in combination with merge()

   1 > Reduce(function(x,y) { merge(x,y, all=T) },
   2 +        list(d1,d2,d3,d4))
   3 id day1 day2 day3 day4
   4 1  A    3    8   NA    2
   5 2  B    1   NA    8    7
   6 3  C    7    2   NA    8
   7 4  E   NA    5    4    1
   8 5  F   NA    3   10   NA
   9 6  D   NA   NA    3    9
  • which is exactly what we want

Reduce()

  • a second example in combination with rbind()

   1 > d4$day <- names(d4)[2]
   2 > names(d4)[2] <- "score"
   3 > Reduce(function(x,y) { y$day <- names(y)[2]
   4 +                        names(y)[2] <- "score"
   5 +                        rbind(x,y) } ,
   6 +        list(d1,d2,d3), init = d4)
   7 id score  day
   8 1   A     2 day4
   9 2   B     7 day4
  10 3   C     8 day4
  11 4   D     9 day4
  • which is exactly what we want

A second function

  • well that's better, but it is still boring to do this for every single file
  • so see what we have learned: the combination of lapply() and Reduce() can do the work
  • using dir{} we get all the files contained in a given directory
  • then we use lapply() together with our new function read.file()

dir()

  • dir() without additional argument shows all files/directories in the working directory

   1 > dir()
   2 [1] "data"                  "function.r"            "function.r~"
   3 [4] "ggp1.pdf"              "graphics.r"            "linkimage.aux"
   4 [7] "session2apply.aux"     "session2apply.log"     "session2apply.nav"
   5 [10] "session2apply.out"     "session2apply.pdf"     "session2apply.snm"
   6 [13] "session2apply.tex"     "session2apply.tex~"    "#session2apply.tex#"
   7 [16] "session2apply.toc"     "session2apply.vrb"     "session2hadley.aux"
   8 [19] "session2hadley.log"    "session2hadley.nav"    "session2hadley.out"
   9 [22] "session2hadley.pdf"    "session2hadley.snm"    "session2hadley.tex"
  10 [25] "session2hadley.tex~"   "session2hadley.toc"    "session2hadley.vrb"
  11 [28] "solutionssession1.r"   "solutionssession1.r~"  "solutionssession2.r"
  12 [31] "solutionssession2.r~"  "#solutionssession2.r#"

dir()

  • given a path dir() will show the content of resp folder

   1 > dir("data")
   2 [1] "posttest"   "pretest"    "training_1" "training_2" "training_3"
   3 [6] "training_4" "training_5" "training_6" "training_7" "training_8"

dir()

  • setting recursive to TRUE R will recurse into directories recursively through

   1 > dir("data",recursive = T)
   2 [1] "posttest/post_001.txt"      "posttest/post_002.txt"
   3 [3] "posttest/post_003.txt"      "posttest/post_004.txt"
   4 [5] "posttest/post_005.txt"      "posttest/post_006.txt"
   5 [7] "posttest/post_007.txt"      "posttest/post_008.txt"

dir()

  • setting full.names to TRUE R will give the full path

   1 > dir("data",recursive = T, full.names = T)
   2 [1] "data/posttest/post_001.txt"      "data/posttest/post_002.txt"
   3 [3] "data/posttest/post_003.txt"      "data/posttest/post_004.txt"
   4 [5] "data/posttest/post_005.txt"      "data/posttest/post_006.txt"

dir()

  • with pattern we can specify which files to show (regexpr), e.g. all r files

   1 > dir(pattern = "\\.r$")
   2 [1] "function.r"          "graphics.r"          "solutionssession1.r"
   3 [4] "solutionssession2.r"

dir() Exercise

  • create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list

dir() Exercise

   1 > dir("data",full.names = T, recursive = T,pattern = "txt$"
   2 + )
   3 [1] "data/posttest/post_001.txt"    "data/posttest/post_002.txt"
   4 [3] "data/posttest/post_003.txt"    "data/posttest/post_004.txt"
   5 [5] "data/posttest/post_005.txt"    "data/posttest/post_006.txt"
   6 > files <- dir("data",full.names = T, recursive = T,pattern = "txt$")

Read all files

Now we use lapply() and our function read.file() to read all files in files

   1 > df.list <- lapply(files,read.file,skip=0)
   2 [1] "read data/posttest/post_001.txt"
   3 [1] "read data/posttest/post_002.txt"
   4 [1] "read data/posttest/post_003.txt"
   5 [1] "read data/posttest/post_004.txt"
   6 [1] "read data/posttest/post_005.txt"

Reading all files

  • the object df.list is a list containing 192 data frames

   1 > sapply(df.list,class)
   2 [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   3 [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   4 [11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   5 [16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   6 [21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   7 [26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   8 [31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
   9 [36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  10 [41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  11 [46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  12 [51] "data.frame" "NULL"       "data.frame" "data.frame" "data.frame"
  13 [56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  14 [61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  15 [66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
  16 [71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"

The Function 2

  • in a last step we use Reduce{} to combine these 192 data frames

   1 > data <- Reduce(rbind,df.list)
   2 > nrow(data)
   3 [1] 12704
   4 > table(data$Subject)
   5 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
   6 93        91        96        93        95        95        93        96
   7 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
   8 92        94        95        96        96        95        96        94
   9 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
  10 95        94        96        95        95        95        96        94
  11 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1
  12 96        95        94        90        96        95        91        96
  13 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1
  14 95        96        95        91        96        96        96        96
  15 001_1     002_1     003_1     004_1     005_1     006_1    CHGU_1    008_1a
  16 60        59        60        54        60        59        60        60
  17 009_1     010_1     RMK_1     013_1     014_1     015_1     016_1    IJ2K_1
  18 60        60        59        59        60        58        59        58
  19 018_1     019_1     020_1     001_2     002_2     003_2     004_2     005_2
  20 60        59        60        59        59        57        58        57
  21 006_2     007_2     008_2     009_2     010_2     011_2     012_2     013_2
  22 58        58        54        58        58        59        59        56
  23 014_2     015_2     016_2     017_2     018_2     019_2     020_2     001_3

The Function no 2

  • so it is recommended to build again a function out of this

   1 > read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
   2 +     files <- dir(filesdir,
   3 +                  full.names = T,
   4 +                  recursive = recursive,
   5 +                  pattern = pattern)
   6 +     Reduce(rbind,lapply(files,read.file,skip=skip))}
   7 > data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
   8 [1] "read data/posttest/post_001.txt"
   9 [1] "read data/posttest/post_002.txt"
  10 [1] "read data/posttest/post_003.txt"
  11 [1] "read data/posttest/post_004.txt"
  12 [1] "read data/posttest/post_005.txt"

The Subject column

  • table the Subject column again. What is the problem?

The Subject column

   1 > table(data$Subject)
   2 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
   3 93        91        96        93        95        95        93        96
   4 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
   5 92        94        95        96        96        95        96        94
   6 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
   7 95        94        96        95        95        95        96        94
  • subject and time coded in one variable

The Subject column

  • we create two new variables using the str\_split() function (stringr package)
  • becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
  • then correct some of the person ids

   1 > data$persid <- sapply(data$Subject,function(x)
   2 +     str_split(x,pattern = "_")[[1]][1])
   3 > data$testid <- sapply(data$Subject,function(x)
   4 +     str_split(x,pattern = "_")[[1]][2])
   5 > data$persid[data$persid=="CHGU"] <- "007"

The Subject column Exercises

  • there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!

The Subject column Exercises

   1 > data$persid[data$persid=="RMK"] <- "011"
   2 > data$persid[data$persid=="IJ2K"] <- "017"
   3 > data$persid[data$persid=="GA3K"] <- "004"
   4 > data$persid[data$persid=="Kj6K"] <- "006"

Merging

* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)

The Subject column Exercises

   1 > persdat <- read.table("data/subjectdemographics.txt",
   2 +                       sep="\t",
   3 +                       header=T)
   4 > persdat$Subject
   5 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
   6 > unique(data$persid)
   7 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
   8 [13] "013" "014" "015" "016" "017" "018" "019" "020"
   9 > data$persid <- as.numeric(data$persid)
  10 > unique(data$persid)
  11 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
  12 > data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
  13 In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid",  :
  14 column nameSubjectis duplicated in the result
  15 > head(data)
  16 Subject Sex Age_PRETEST   Subject Trial Event.Type Code   Time TTime
  17 1       1   f        3.11 001_test2     7   Response    2 103745  2575
  18 2       1   f        3.11 001_test2    12   Response    2 156493  2737
  19 3       1   f        3.11 001_test2    17   Response    2 214772  6630
  20 4       1   f        3.11 001_test2    22   Response    1 262086  5957
  21 5       1   f        3.11 001_test2    27   Response    2 302589   272
  22 6       1   f        3.11 001_test2    32   Response    1 352703  7197

Summary Graphics

Just run the code and try to understand it. We will cover the ggplot graphics in the next session.

   1 > ggplot(data,aes(x=factor(Subject),fill=..count..)) +
   2 +     geom_bar() +
   3 +     facet_wrap(~testid)

<img alt='sesssion2/graph1.png' src='-1' />

Summary Graphics

  • so there are problems in coding of the test id
  • we remove the letters at the end using str\_replace()

   1 > data$testid <- str_replace(data$testid,"[a-z]$","")
   2 > data$testid <- factor(data$testid,
   3 +                       levels=c("test1","1","2","3","4","5","6","7","8","test2"))
   4 > table(data$Subject,data$testid)
   5 test1  1  2  3  4  5  6  7  8 test2
   6 1     95 60 59 60 59 59 60 60 60    93
   7 2     95 59 59 58 60 60 60 60 60    91
   8 3     96 60 57 60 60 60 60 59 58    96
   9 4     94 54 58 60 60 55 53 60 58    93
  10 5     96 60 57 60 60 60 60 60  0    95
  11 6     95 59 58 59 58 59 55 54 55    95
  12 7     94 60 58 60 58 59 60 59 59    93
  13 8     90 60 54 55 60 60 60 59 60    96
  14 9     96 60 58 59 57  0  0 58 56    92
  15 10    95 60 58 58 60 60 58  0  0    94
  16 11    91 59 59 60 60 57 58 60 60    95

Summary Graphics

   1 > ggplot(data,aes(x=factor(Subject),fill=..count..)) +
   2 +     geom_bar() +
   3 +     facet_wrap(~testid)

<img alt='sesssion2/graph2.png' src='-1' />

Summary Graphics

And another one.

   1 > ggplot(data,aes(x=testid,fill=Stim.Type)) +
   2 +     geom_bar(position=position_fill()) +
   3 +     facet_wrap(~Subject)

<img alt='sesssion2/graph3.png' src='-1' />

RstatisTik/RstatisTikPortal/RcourSe/CourseOutline/FunctionsInR/ApplyR (zuletzt geändert am 2015-05-01 10:48:36 durch mandy.vogel@googlemail.com)