Größe: 32159
Kommentar:
|
Größe: 31972
Kommentar:
|
Gelöschter Text ist auf diese Art markiert. | Hinzugefügter Text ist auf diese Art markiert. |
Zeile 3: | Zeile 3: |
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions: * lapply() * sapply() * apply() * tapply() |
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions: * lapply() * sapply() * apply() * tapply() |
Zeile 10: | Zeile 11: |
* The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example: | * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example: |
Zeile 18: | Zeile 20: |
* apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example | * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example |
Zeile 22: | Zeile 25: |
> apply(x,3,quantile) ## calculate the quantiles | > apply(x,3,quantile) ## calculate the quantiles |
Zeile 25: | Zeile 28: |
* The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors. | * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors. |
Zeile 27: | Zeile 31: |
Zeile 33: | Zeile 38: |
== == * the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function {{{#!highlight r m <- matrix(rnorm(100),nrow=10) }}} * use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{#!latex \sim$; it is read as 'is dependent on'and it is extensively used in modelling |
* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function {{{#!highlight r m <- matrix(rnorm(100),nrow=10) }}} * use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling |
Zeile 55: | Zeile 57: |
4 M N 0.00 3.50 8.00 14.71 19.50 69.00 | 4 M N 0.00 3.50 8.00 14.71 19.50 69.00 |
Zeile 59: | Zeile 61: |
* a body (the code inside the function) - body() * arguments (the list of arguments which controls how you can call the function) - formals() * an environment (the “map” of the location of the function’s variables) - environment() |
* a body (the code inside the function) - body() * arguments (the list of arguments which controls how you can call the function) - formals() * an environment (the “map” of the location of the function’s variables) - environment() |
Zeile 63: | Zeile 67: |
Zeile 66: | Zeile 71: |
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), | function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), |
Zeile 68: | Zeile 73: |
if (is.data.frame(x)) expected = E, residuals = (x - E)/sqrt(E), stdres = (x - |
if (is.data.frame(x)) expected = E, residuals = (x - E)/sqrt(E), stdres = (x - |
Zeile 74: | Zeile 79: |
Arguments are matched * first by exact name (perfect matching) * then by prefix matching * and finally by position. |
Arguments are matched * first by exact name (perfect matching) * then by prefix matching * and finally by position. |
Zeile 79: | Zeile 86: |
Zeile 86: | Zeile 94: |
> | > |
Zeile 89: | Zeile 97: |
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. |
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. |
Zeile 94: | Zeile 100: |
* Write a function to compute the average distance from the mean for some data vector. | * Write a function to compute the average distance from the mean for some data vector. |
Zeile 99: | Zeile 106: |
+ } | + } |
Zeile 102: | Zeile 109: |
* Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} | * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} |
Zeile 108: | Zeile 116: |
[1] 8.25 | [1] 8.25 |
Zeile 111: | Zeile 119: |
* An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? | * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? |
Zeile 125: | Zeile 134: |
* Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. | * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. |
Zeile 136: | Zeile 146: |
[1] FALSE | [1] FALSE |
Zeile 148: | Zeile 158: |
> tmp <- tmp[!is.na(tmp$Subject),] | > tmp <- tmp[!is.na(tmp$Subject),] |
Zeile 152: | Zeile 162: |
Zeile 171: | Zeile 182: |
> first.pic <- min(which(tmp$Event.Type=="Picture" & + !is.na(tmp$Event.Type) )) - 1 |
> first.pic <- min(which(tmp$Event.Type=="Picture" & + !is.na(tmp$Event.Type) )) - 1 |
Zeile 174: | Zeile 185: |
> last.pic <- min(which(tmp$Event.Type=="Picture" & | > last.pic <- min(which(tmp$Event.Type=="Picture" & |
Zeile 176: | Zeile 187: |
+ tmp$Code=="Fertig!" & | + tmp$Code=="Fertig!" & |
Zeile 210: | Zeile 221: |
Zeile 258: | Zeile 270: |
* it would be a tedious work to every step for all of the files * if we look through the steps the only important thing that we have to change is the file name * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file): |
* it would be a tedious work to every step for all of the files * if we look through the steps the only important thing that we have to change is the file name * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file): |
Zeile 266: | Zeile 279: |
tmp <- tmp[!is.na(tmp$Subject),] | tmp <- tmp[!is.na(tmp$Subject),] |
Zeile 299: | Zeile 312: |
* we can use this function now to read in the file * and get the processed data frame in one step * setting the parameter skip we can read both versions of the file (and should get the same result) |
* we can use this function now to read in the file * and get the processed data frame in one step * setting the parameter skip we can read both versions of the file (and should get the same result) |
Zeile 303: | Zeile 317: |
* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon. |
* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon. |
Zeile 317: | Zeile 329: |
* rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects | * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects |
Zeile 331: | Zeile 344: |
* cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects | * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects |
Zeile 341: | Zeile 355: |
* it is not recommended to use cbind() to combining data frames | * it is not recommended to use cbind() to combining data frames |
Zeile 343: | Zeile 358: |
* merge() is the command of choice for merging or joining data frames * it is the equivalent of join in sql * there are four cases * inner join * left outer join * right outer join * full outer join |
* merge() is the command of choice for merging or joining data frames * it is the equivalent of join in sql * there are four cases * inner join * left outer join * right outer join * full outer join |
Zeile 364: | Zeile 380: |
* inner join means: keep only the cases present in both of the data frames | * inner join means: keep only the cases present in both of the data frames |
Zeile 372: | Zeile 389: |
* left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T) | * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T) |
Zeile 381: | Zeile 399: |
* right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T) | * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T) |
Zeile 391: | Zeile 410: |
* full outer join means: keep all cases of both data frames (all=T) | * full outer join means: keep all cases of both data frames (all=T) |
Zeile 402: | Zeile 422: |
* if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id} * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or * using\\ \texttt{by.x=c("colname1.x","colname2.x"), |
* if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id} * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or * using\\ \texttt{by.x=c("colname1.x","colname2.x"), |
Zeile 406: | Zeile 427: |
* now read in the file personendaten.txt using the appropriate command * join the demographics with our pre1 data frame (even though it does not make sense now) |
* now read in the file personendaten.txt using the appropriate command * join the demographics with our pre1 data frame (even though it does not make sense now) |
Zeile 431: | Zeile 453: |
* is a higher order function (functional) * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list * it can be used if you have not only two but many data frames |
* is a higher order function (functional) * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list * it can be used if you have not only two but many data frames |
Zeile 435: | Zeile 458: |
* first we make up 4 artifical data frames | * first we make up 4 artifical data frames |
Zeile 464: | Zeile 488: |
* now we use Reduce() in combination with merge() | * now we use Reduce() in combination with merge() |
Zeile 469: | Zeile 494: |
* and what we get is an empty data frame * well this isn't exactly what we wanted, so why? * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none * so we have to define a wrapper function which only change this argument to all=T |
* and what we get is an empty data frame * well this isn't exactly what we wanted, so why? * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none * so we have to define a wrapper function which only change this argument to all=T |
Zeile 474: | Zeile 500: |
* now we use Reduce() in combination with merge() | * now we use Reduce() in combination with merge() |
Zeile 486: | Zeile 513: |
* which is exactly what we want | * which is exactly what we want |
Zeile 488: | Zeile 516: |
* a second example in combination with rbind() | * a second example in combination with rbind() |
Zeile 502: | Zeile 531: |
* which is exactly what we want | * which is exactly what we want |
Zeile 504: | Zeile 534: |
* well that's better, but it is still boring to do this for every single file * so see what we have learned: the combination of lapply() and Reduce() can do the work * using dir{} we get all the files contained in a given directory * then we use lapply() together with our new function read.file() |
* well that's better, but it is still boring to do this for every single file * so see what we have learned: the combination of lapply() and Reduce() can do the work * using dir{} we get all the files contained in a given directory * then we use lapply() together with our new function read.file() |
Zeile 509: | Zeile 540: |
* dir() without additional argument shows all files/directories in the working directory | * dir() without additional argument shows all files/directories in the working directory |
Zeile 512: | Zeile 544: |
[1] "data" "function.r" "function.r~" [4] "ggp1.pdf" "graphics.r" "linkimage.aux" [7] "session2apply.aux" "session2apply.log" "session2apply.nav" [10] "session2apply.out" "session2apply.pdf" "session2apply.snm" [13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#" [16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux" [19] "session2hadley.log" "session2hadley.nav" "session2hadley.out" [22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex" [25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb" [28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r" |
[1] "data" "function.r" "function.r~" [4] "ggp1.pdf" "graphics.r" "linkimage.aux" [7] "session2apply.aux" "session2apply.log" "session2apply.nav" [10] "session2apply.out" "session2apply.pdf" "session2apply.snm" [13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#" [16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux" [19] "session2hadley.log" "session2hadley.nav" "session2hadley.out" [22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex" [25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb" [28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r" |
Zeile 525: | Zeile 557: |
* given a path dir() will show the content of resp folder | * given a path dir() will show the content of resp folder |
Zeile 532: | Zeile 565: |
* setting recursive to TRUE R will recurse into directories recursively through | * setting recursive to TRUE R will recurse into directories recursively through |
Zeile 535: | Zeile 569: |
[1] "posttest/post_001.txt" "posttest/post_002.txt" [3] "posttest/post_003.txt" "posttest/post_004.txt" [5] "posttest/post_005.txt" "posttest/post_006.txt" [7] "posttest/post_007.txt" "posttest/post_008.txt" |
[1] "posttest/post_001.txt" "posttest/post_002.txt" [3] "posttest/post_003.txt" "posttest/post_004.txt" [5] "posttest/post_005.txt" "posttest/post_006.txt" [7] "posttest/post_007.txt" "posttest/post_008.txt" |
Zeile 541: | Zeile 575: |
* setting full.names to TRUE R will give the full path | * setting full.names to TRUE R will give the full path |
Zeile 544: | Zeile 579: |
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" |
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" |
Zeile 549: | Zeile 584: |
* with pattern we can specify which files to show (regexpr), e.g. all r files | * with pattern we can specify which files to show (regexpr), e.g. all r files |
Zeile 556: | Zeile 592: |
* create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list | * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list |
Zeile 561: | Zeile 598: |
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" |
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" |
Zeile 568: | Zeile 605: |
Zeile 577: | Zeile 615: |
* the object df.list is a list containing 192 data frames | * the object df.list is a list containing 192 data frames |
Zeile 597: | Zeile 636: |
* in a last step we use Reduce{} to combine these 192 data frames | * in a last step we use Reduce{} to combine these 192 data frames |
Zeile 603: | Zeile 643: |
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 96 95 94 90 96 95 91 96 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1 95 96 95 91 96 96 96 96 001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a 60 59 60 54 60 59 60 60 009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1 60 60 59 59 60 58 59 58 018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2 60 59 60 59 59 57 58 57 006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2 58 58 54 58 58 59 59 56 014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3 |
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 96 95 94 90 96 95 91 96 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1 95 96 95 91 96 96 96 96 001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a 60 59 60 54 60 59 60 60 009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1 60 60 59 59 60 58 59 58 018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2 60 59 60 59 59 57 58 57 006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2 58 58 54 58 58 59 59 56 014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3 |
Zeile 624: | Zeile 664: |
* so it is recommended to build again a function out of this | * so it is recommended to build again a function out of this |
Zeile 640: | Zeile 681: |
* table the Subject column again. What is the problem? | * table the Subject column again. What is the problem? |
Zeile 644: | Zeile 686: |
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 }}} * subject and time coded in one variable |
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 }}} * subject and time coded in one variable |
Zeile 653: | Zeile 696: |
* we create two new variables using the str\_split() function (stringr package) * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply() * then correct some of the person ids |
* we create two new variables using the str\_split() function (stringr package) * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply() * then correct some of the person ids |
Zeile 664: | Zeile 708: |
* there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them! | * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them! |
Zeile 673: | Zeile 718: |
* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns) |
* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns) |
Zeile 702: | Zeile 747: |
Zeile 708: | Zeile 754: |
Zeile 709: | Zeile 756: |
* so there are problems in coding of the test id * we remove the letters at the end using str\_replace() |
* so there are problems in coding of the test id * we remove the letters at the end using str\_replace() |
Zeile 736: | Zeile 784: |
Zeile 738: | Zeile 787: |
Introduction
Implicit Loops
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:
- lapply()
- sapply()
- apply()
- tapply()
lapply()
- The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:
apply()
- apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
tapply()
- The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:
- the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
1 m <- matrix(rnorm(100),nrow=10)
use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling
1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
2 Sex Eth Days
3 1 F A 20.92105
4 2 M A 21.61290
5 3 F N 10.07143
6 4 M N 14.71429
7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
9 1 F A 0.00 5.25 13.50 20.92 30.25 81.00
10 2 M A 2.00 9.50 16.00 21.61 33.00 57.00
11 3 F N 0.00 5.00 7.00 10.07 14.00 37.00
12 4 M N 0.00 3.50 8.00 14.71 19.50 69.00
Functions
Every function in R has three important characteristics:
- a body (the code inside the function) - body()
- arguments (the list of arguments which controls how you can call the function) - formals()
- an environment (the “map” of the location of the function’s variables) - environment()
You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.
Functions
Function Arguments
Arguments are matched
- first by exact name (perfect matching)
- then by prefix matching
- and finally by position.
By default, R function arguments are lazy, they are only evaluated if they are actually used:
Function Exercises (Verzani)
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Function Exercises (Verzani)
- Write a function to compute the average distance from the mean for some data vector.
Function Exercises (Verzani)
- Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
Function Exercises (Verzani)
- An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
Function Exercises (Verzani)
- Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Read in the file
Remove empty line
1 > tmp <- tmp[!is.na(tmp$Subject),]
Remove spaces
Remove unnecessary spaces from character vectors/factors
Find/Remove breaks
Find/Remove first/last rows
Extract Responses
Extract Responses
1 > responses <- which(tmp$Code %in% c(1,2))
2 > events <- responses-1
3 > tmp$Type <- NA
4 > tmp$Type[responses] <- as.character(tmp$Event.Type[events])
5 > head(tmp)
6 Subject Trial Event.Type Code Time TTime Uncertainty Duration
7 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
8 7 PRE001 7 Response 2 178963 10009 1 NA
9 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
10 12 PRE001 12 Response 1 238680 8342 1 NA
11 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
12 17 PRE001 17 Response 2 297789 8066 1 NA
13 6 2 0 next incorrect 7 <NA>
14 7 NA NA <NA> <NA> NA Picture
15 11 2 0 next incorrect 12 <NA>
16 12 NA NA <NA> <NA> NA Picture
17 16 2 0 next hit 17 <NA>
18 17 NA NA <NA> <NA> NA Picture
Moving Information
Moving all (necessary) information to the response lines.
1 > tmp$Event.Code <- NA
2 > tmp$Event.Code[responses] <- as.character(tmp$Code[events])
3 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
4 > tmp$Duration[responses] <- as.character(tmp$Duration[events])
5 > tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
6 > tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
7 > tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
8 > tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
9 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
Moving Information
1 > head(tmp)
2 Subject Trial Event.Type Code Time TTime Uncertainty Duration
3 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
4 7 PRE001 7 Response 2 178963 10009 1 10197
5 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
6 12 PRE001 12 Response 1 238680 8342 1 8398
7 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
8 17 PRE001 17 Response 2 297789 8066 1 8198
9 6 2 0 next incorrect 7 <NA> <NA>
10 7 2 0 next incorrect 7 Picture RO09.jpg
11 11 2 0 next incorrect 12 <NA> <NA>
12 12 2 0 next incorrect 12 Picture RO20.jpg
13 16 2 0 next hit 17 <NA> <NA>
14 17 2 0 next hit 17 Picture RS28.jpg
Keep response lines
1 > tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
2 > tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
3 > head(tmp)
4 Subject Trial Event.Type Code Time TTime Uncertainty Duration
5 7 PRE001 7 Response 2 178963 10009 1 10197
6 12 PRE001 12 Response 1 238680 8342 1 8398
7 17 PRE001 17 Response 2 297789 8066 1 8198
8 22 PRE001 22 Response 1 351321 10811 1 10997
9 27 PRE001 27 Response 2 403607 713 1 800
10 32 PRE001 32 Response 1 467793 23709 1 23794
11 7 2 0 next incorrect 7 Picture RO09.jpg
12 12 2 0 next incorrect 12 Picture RO20.jpg
13 17 2 0 next hit 17 Picture RS28.jpg
14 22 2 0 next hit 22 Picture AT26.jpg
15 27 2 0 next hit 27 Picture RS23.jpg
16 32 2 0 next hit 32 Picture OF04.jpg
The Function
- it would be a tedious work to every step for all of the files
- if we look through the steps the only important thing that we have to change is the file name
- so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):
1 tmp <- read.table(file,skip = skip,sep = "\t",
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
- we can use this function now to read in the file
- and get the processed data frame in one step
- setting the parameter skip we can read both versions of the file (and should get the same result)
The Function Exercise
* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon.
The Function (continued)
rbind()
- rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects
cbind()
- cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects
- it is not recommended to use cbind() to combining data frames
merge()
- merge() is the command of choice for merging or joining data frames
- it is the equivalent of join in sql
- there are four cases
- inner join
- left outer join
- right outer join
- full outer join
inner join
- inner join means: keep only the cases present in both of the data frames
left outer join
- left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)
right outer join
- right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)
full outer join
- full outer join means: keep all cases of both data frames (all=T)
merge()
- if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
- you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
- using\\ \texttt{by.x=c("colname1.x","colname2.x"),
merge() Exercise
- now read in the file personendaten.txt using the appropriate command
- join the demographics with our pre1 data frame (even though it does not make sense now)
merge() Exercise
1 > persdat <- read.table("../session1/session1data/personendaten.txt",
2 + sep="\t",
3 + header=T)
4 > pre1 <- merge(persdat,pre1,all.y = T)
5 > head(pre1)
6 Subject Sex Age_PRETEST Trial Event.Type Code Time TTime Uncertainty
7 1 PRE001 f 3.11 7 Response 2 178963 10009 1
8 2 PRE001 f 3.11 12 Response 1 238680 8342 1
9 3 PRE001 f 3.11 17 Response 2 297789 8066 1
10 4 PRE001 f 3.11 22 Response 1 351321 10811 1
11 5 PRE001 f 3.11 27 Response 2 403607 713 1
12 6 PRE001 f 3.11 32 Response 1 467793 23709 1
13 Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index Type Event.Code
14 1 10197 2 0 next incorrect 7 Picture RO09.jpg
15 2 8398 2 0 next incorrect 12 Picture RO20.jpg
16 3 8198 2 0 next hit 17 Picture RS28.jpg
17 4 10997 2 0 next hit 22 Picture AT26.jpg
18 5 800 2 0 next hit 27 Picture RS23.jpg
19 6 23794 2 0 next hit 32 Picture OF04.jpg
Reduce()
- is a higher order function (functional)
- Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
- it can be used if you have not only two but many data frames
Reduce()
- first we make up 4 artifical data frames
Reduce()
1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
2 id day1
3 1 A 3
4 2 B 1
5 3 C 7
6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
7 id day2
8 1 A 8
9 2 C 2
10 3 E 5
11 4 F 3
12 > (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
13 id day3
14 1 B 8
15 2 D 3
16 3 E 4
17 4 F 10
18 > (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
19 id day4
20 1 A 2
21 2 B 7
22 3 C 8
23 4 D 9
24 5 E 1
Reduce()
- now we use Reduce() in combination with merge()
- and what we get is an empty data frame
- well this isn't exactly what we wanted, so why?
- it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
- so we have to define a wrapper function which only change this argument to all=T
Reduce()
- now we use Reduce() in combination with merge()
- which is exactly what we want
Reduce()
- a second example in combination with rbind()
- which is exactly what we want
A second function
- well that's better, but it is still boring to do this for every single file
- so see what we have learned: the combination of lapply() and Reduce() can do the work
- using dir{} we get all the files contained in a given directory
- then we use lapply() together with our new function read.file()
dir()
- dir() without additional argument shows all files/directories in the working directory
1 > dir()
2 [1] "data" "function.r" "function.r~"
3 [4] "ggp1.pdf" "graphics.r" "linkimage.aux"
4 [7] "session2apply.aux" "session2apply.log" "session2apply.nav"
5 [10] "session2apply.out" "session2apply.pdf" "session2apply.snm"
6 [13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#"
7 [16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux"
8 [19] "session2hadley.log" "session2hadley.nav" "session2hadley.out"
9 [22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex"
10 [25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb"
11 [28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r"
12 [31] "solutionssession2.r~" "#solutionssession2.r#"
dir()
- given a path dir() will show the content of resp folder
dir()
- setting recursive to TRUE R will recurse into directories recursively through
dir()
- setting full.names to TRUE R will give the full path
dir()
- with pattern we can specify which files to show (regexpr), e.g. all r files
dir() Exercise
- create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list
dir() Exercise
1 > dir("data",full.names = T, recursive = T,pattern = "txt$"
2 + )
3 [1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
4 [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
5 [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
6 > files <- dir("data",full.names = T, recursive = T,pattern = "txt$")
Read all files
Now we use lapply() and our function read.file() to read all files in files
Reading all files
- the object df.list is a list containing 192 data frames
1 > sapply(df.list,class)
2 [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
3 [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
4 [11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
5 [16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
6 [21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
7 [26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
8 [31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
9 [36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
10 [41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
11 [46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
12 [51] "data.frame" "NULL" "data.frame" "data.frame" "data.frame"
13 [56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
14 [61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
15 [66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
16 [71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
The Function 2
- in a last step we use Reduce{} to combine these 192 data frames
1 > data <- Reduce(rbind,df.list)
2 > nrow(data)
3 [1] 12704
4 > table(data$Subject)
5 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
6 93 91 96 93 95 95 93 96
7 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
8 92 94 95 96 96 95 96 94
9 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
10 95 94 96 95 95 95 96 94
11 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1
12 96 95 94 90 96 95 91 96
13 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1
14 95 96 95 91 96 96 96 96
15 001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a
16 60 59 60 54 60 59 60 60
17 009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1
18 60 60 59 59 60 58 59 58
19 018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2
20 60 59 60 59 59 57 58 57
21 006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2
22 58 58 54 58 58 59 59 56
23 014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3
The Function no 2
- so it is recommended to build again a function out of this
1 > read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
2 + files <- dir(filesdir,
3 + full.names = T,
4 + recursive = recursive,
5 + pattern = pattern)
6 + Reduce(rbind,lapply(files,read.file,skip=skip))}
7 > data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
8 [1] "read data/posttest/post_001.txt"
9 [1] "read data/posttest/post_002.txt"
10 [1] "read data/posttest/post_003.txt"
11 [1] "read data/posttest/post_004.txt"
12 [1] "read data/posttest/post_005.txt"
The Subject column
- table the Subject column again. What is the problem?
The Subject column
1 > table(data$Subject)
2 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
3 93 91 96 93 95 95 93 96
4 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
5 92 94 95 96 96 95 96 94
6 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
7 95 94 96 95 95 95 96 94
- subject and time coded in one variable
The Subject column
- we create two new variables using the str\_split() function (stringr package)
- becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
- then correct some of the person ids
The Subject column Exercises
- there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!
The Subject column Exercises
Merging
* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
The Subject column Exercises
1 > persdat <- read.table("data/subjectdemographics.txt",
2 + sep="\t",
3 + header=T)
4 > persdat$Subject
5 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6 > unique(data$persid)
7 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
8 [13] "013" "014" "015" "016" "017" "018" "019" "020"
9 > data$persid <- as.numeric(data$persid)
10 > unique(data$persid)
11 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
12 > data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
13 In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid", :
14 column name ‘Subject’ is duplicated in the result
15 > head(data)
16 Subject Sex Age_PRETEST Subject Trial Event.Type Code Time TTime
17 1 1 f 3.11 001_test2 7 Response 2 103745 2575
18 2 1 f 3.11 001_test2 12 Response 2 156493 2737
19 3 1 f 3.11 001_test2 17 Response 2 214772 6630
20 4 1 f 3.11 001_test2 22 Response 1 262086 5957
21 5 1 f 3.11 001_test2 27 Response 2 302589 272
22 6 1 f 3.11 001_test2 32 Response 1 352703 7197
Summary Graphics
Just run the code and try to understand it. We will cover the ggplot graphics in the next session.
<img alt='sesssion2/graph1.png' src='-1' />
Summary Graphics
- so there are problems in coding of the test id
- we remove the letters at the end using str\_replace()
1 > data$testid <- str_replace(data$testid,"[a-z]$","")
2 > data$testid <- factor(data$testid,
3 + levels=c("test1","1","2","3","4","5","6","7","8","test2"))
4 > table(data$Subject,data$testid)
5 test1 1 2 3 4 5 6 7 8 test2
6 1 95 60 59 60 59 59 60 60 60 93
7 2 95 59 59 58 60 60 60 60 60 91
8 3 96 60 57 60 60 60 60 59 58 96
9 4 94 54 58 60 60 55 53 60 58 93
10 5 96 60 57 60 60 60 60 60 0 95
11 6 95 59 58 59 58 59 55 54 55 95
12 7 94 60 58 60 58 59 60 59 59 93
13 8 90 60 54 55 60 60 60 59 60 96
14 9 96 60 58 59 57 0 0 58 56 92
15 10 95 60 58 58 60 60 58 0 0 94
16 11 91 59 59 60 60 57 58 60 60 95
Summary Graphics
<img alt='sesssion2/graph2.png' src='-1' />
Summary Graphics
And another one.
<img alt='sesssion2/graph3.png' src='-1' />