Größe: 32164
Kommentar:
|
Größe: 6351
Kommentar:
|
Gelöschter Text ist auf diese Art markiert. | Hinzugefügter Text ist auf diese Art markiert. |
Zeile 2: | Zeile 2: |
== Implicit Loops == A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions: * lapply() * sapply() * apply() * tapply() |
Every function in R has three important characteristics: * a body (the code inside the function) - body() * arguments (the list of arguments which controls how you can call the function) - formals() * an environment (the “map” of the location of the function’s variables) - environment() You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL. == Functions == {{{#!highlight r > chisq.test function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), DNAME <- deparse(substitute(x)) if (is.data.frame(x)) expected = E, residuals = (x - E)/sqrt(E), stdres = (x - > sum function (..., na.rm = FALSE) .Primitive("sum") }}} == Function Arguments == Arguments are matched * first by exact name (perfect matching) * then by prefix matching * and finally by position. By default, R function arguments are lazy, they are only evaluated if they are actually used: {{{#!highlight r > f <- function(x) { f <- function(x) { + 10 + } > f(stop("This is an error!")) [1] 10 > }}} = Implicit Loops = == Introduction == A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions: * lapply() * sapply() * apply() * tapply() |
Zeile 10: | Zeile 52: |
* The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example: | * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example: |
Zeile 18: | Zeile 61: |
* apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example | * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example |
Zeile 22: | Zeile 66: |
> apply(x,3,quantile) ## calculate the quantiles | > apply(x,3,quantile) ## calculate the quantiles |
Zeile 25: | Zeile 69: |
* The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors. | * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors. |
Zeile 27: | Zeile 72: |
Zeile 33: | Zeile 79: |
== == * the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function |
* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function |
Zeile 39: | Zeile 85: |
m <- matrix(rnorm(100),nrow=10) | m <- matrix(rnorm(100),nrow=10) |
Zeile 41: | Zeile 87: |
* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling |
* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling |
Zeile 55: | Zeile 101: |
4 M N 0.00 3.50 8.00 14.71 19.50 69.00 }}} == Functions == Every function in R has three important characteristics: * a body (the code inside the function) - body() * arguments (the list of arguments which controls how you can call the function) - formals() * an environment (the “map” of the location of the function’s variables) - environment() You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL. == Functions == {{{#!highlight r > chisq.test function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), DNAME <- deparse(substitute(x)) if (is.data.frame(x)) expected = E, residuals = (x - E)/sqrt(E), stdres = (x - > sum function (..., na.rm = FALSE) .Primitive("sum") }}} == Function Arguments == Arguments are matched * first by exact name (perfect matching) * then by prefix matching * and finally by position. By default, R function arguments are lazy, they are only evaluated if they are actually used: {{{#!highlight r > f <- function(x) { f <- function(x) { + 10 + } > f(stop("This is an error!")) [1] 10 > |
4 M N 0.00 3.50 8.00 14.71 19.50 69.00 |
Zeile 89: | Zeile 104: |
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. == Function Exercises (Verzani) == * Write a function to compute the average distance from the mean for some data vector. |
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10) * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0. == Function Exercises (Verzani) Solutions == * Write a function to compute the average distance from the mean for some data vector. |
Zeile 99: | Zeile 116: |
+ } | + } |
Zeile 101: | Zeile 118: |
== Function Exercises (Verzani) == * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} |
== Function Exercises (Verzani) Solutions == * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} |
Zeile 108: | Zeile 126: |
[1] 8.25 | [1] 8.25 |
Zeile 110: | Zeile 128: |
== Function Exercises (Verzani) == * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? |
== Function Exercises (Verzani) Solutions == * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? |
Zeile 124: | Zeile 143: |
== Function Exercises (Verzani) == * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. |
== Function Exercises (Verzani) Solutions == * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. |
Zeile 136: | Zeile 156: |
[1] FALSE | [1] FALSE |
Zeile 138: | Zeile 158: |
== Read in the file == {{{#!highlight r > file <- "../session1/session1data/pre001.txt" > skip <- 3 > tmp <- read.table(file,skip = skip,sep = "\t", + header=T,na.strings = c(" +",""), + fill=T) }}} == Remove empty line == {{{#!highlight r > tmp <- tmp[!is.na(tmp$Subject),] }}} == Remove spaces == Remove unnecessary spaces from character vectors/factors {{{#!highlight r > tmp <- lapply(tmp,function(x) { + if( class(x) %in% c("character","factor") ){ + x <- factor(gsub(" ","",as.character(x))) + return(x)}else{ return(x) }}) > tmp <- as.data.frame(tmp) }}} == Find/Remove breaks == {{{#!highlight r > if(length(pause)>0){ + drei <- which(tmp$Code==3 & !is.na(tmp$Code)) + drei <- drei[drei > pause][1:2] + if(pause + 1 < drei[1]){ + tmp <- tmp[-(pause:drei[2]),] + }} > tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ] }}} == Find/Remove first/last rows == {{{#!highlight r > first.pic <- min(which(tmp$Event.Type=="Picture" & + !is.na(tmp$Event.Type) )) - 1 > tmp <- tmp[-(1:first.pic),] > last.pic <- min(which(tmp$Event.Type=="Picture" & + !is.na(tmp$Event.Type) & + tmp$Code=="Fertig!" & + !is.na(tmp$Code))) > tmp <- tmp[-(last.pic:nrow(tmp)),] }}} == Extract Responses == {{{#!highlight r > zeilen <- which(tmp$Event.Type %in% c("Response")) > zeilen <- sort(unique(c(zeilen,zeilen-1))) > zeilen <- zeilen[zeilen>0] > tmp <- tmp[zeilen,] }}} == Extract Responses == {{{#!highlight r > responses <- which(tmp$Code %in% c(1,2)) > events <- responses-1 > tmp$Type <- NA > tmp$Type[responses] <- as.character(tmp$Event.Type[events]) > head(tmp) Subject Trial Event.Type Code Time TTime Uncertainty Duration 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197 7 PRE001 7 Response 2 178963 10009 1 NA 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398 12 PRE001 12 Response 1 238680 8342 1 NA 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198 17 PRE001 17 Response 2 297789 8066 1 NA 6 2 0 next incorrect 7 <NA> 7 NA NA <NA> <NA> NA Picture 11 2 0 next incorrect 12 <NA> 12 NA NA <NA> <NA> NA Picture 16 2 0 next hit 17 <NA> 17 NA NA <NA> <NA> NA Picture }}} == Moving Information == Moving all (necessary) information to the response lines. {{{#!highlight r > tmp$Event.Code <- NA > tmp$Event.Code[responses] <- as.character(tmp$Code[events]) > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events]) > tmp$Duration[responses] <- as.character(tmp$Duration[events]) > tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events]) > tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events]) > tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events]) > tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events]) > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events]) }}} == Moving Information == {{{#!highlight r > head(tmp) Subject Trial Event.Type Code Time TTime Uncertainty Duration 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197 7 PRE001 7 Response 2 178963 10009 1 10197 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398 12 PRE001 12 Response 1 238680 8342 1 8398 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198 17 PRE001 17 Response 2 297789 8066 1 8198 6 2 0 next incorrect 7 <NA> <NA> 7 2 0 next incorrect 7 Picture RO09.jpg 11 2 0 next incorrect 12 <NA> <NA> 12 2 0 next incorrect 12 Picture RO20.jpg 16 2 0 next hit 17 <NA> <NA> 17 2 0 next hit 17 Picture RS28.jpg }}} == Keep response lines == {{{#!highlight r > tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),] > tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),] > head(tmp) Subject Trial Event.Type Code Time TTime Uncertainty Duration 7 PRE001 7 Response 2 178963 10009 1 10197 12 PRE001 12 Response 1 238680 8342 1 8398 17 PRE001 17 Response 2 297789 8066 1 8198 22 PRE001 22 Response 1 351321 10811 1 10997 27 PRE001 27 Response 2 403607 713 1 800 32 PRE001 32 Response 1 467793 23709 1 23794 7 2 0 next incorrect 7 Picture RO09.jpg 12 2 0 next incorrect 12 Picture RO20.jpg 17 2 0 next hit 17 Picture RS28.jpg 22 2 0 next hit 22 Picture AT26.jpg 27 2 0 next hit 27 Picture RS23.jpg 32 2 0 next hit 32 Picture OF04.jpg }}} == The Function == * it would be a tedious work to every step for all of the files * if we look through the steps the only important thing that we have to change is the file name * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file): {{{#!highlight r tmp <- read.table(file,skip = skip,sep = "\t", }}} == The Function (continued) == {{{#!highlight r tmp <- tmp[!is.na(tmp$Subject),] tmp <- lapply(tmp,function(x) { x <- factor(gsub(" ","",as.character(x))) tmp <- as.data.frame(tmp) }}} == The Function (continued) == {{{#!highlight r pause <- which(tmp$Event.Type=="Picture" & tmp$Code=="Pause") drei <- which(tmp$Code==3 & !is.na(tmp$Code)) drei <- drei[drei > pause][1:2] tmp <- tmp[-(pause:drei[2]),] tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ] }}} == The Function (continued) == {{{#!highlight r tmp <- tmp[-(1:first.pic),] tmp <- tmp[-(last.pic:nrow(tmp)),] }}} == The Function (continued) == {{{#!highlight r zeilen <- which(tmp$Event.Type %in% c("Response")) zeilen <- sort(unique(c(zeilen,zeilen-1))) zeilen <- zeilen[zeilen>0] tmp <- tmp[zeilen,] responses <- which(tmp$Code %in% c(1,2)) events <- responses-1 }}} == The Function (continued) == {{{#!highlight r tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),] tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),] }}} == The Function (continued) == * we can use this function now to read in the file * and get the processed data frame in one step * setting the parameter skip we can read both versions of the file (and should get the same result) == The Function Exercise == * run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon. == The Function (continued) == {{{#!highlight r > file <- "../session1/session1data/pre001.txt" > pre1 <- read.file(file,skip=3) [1] "read ../session1/session1data/pre001.txt" > file <- "data/pretest/pre_001.txt" > pre1v2 <- read.file(file,skip=0) [1] "read ../session2/data/pretest/pre_001.txt" }}} == rbind() == * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects {{{#!highlight r > x <- data.frame(id=1:3,score=rnorm(3)) > y <- data.frame(id=13:15,score=rnorm(3)) > rbind(x,y) id score 1 1 0.71121163 2 2 -0.62973249 3 3 1.17737595 4 13 -0.45074940 5 14 -0.01044197 6 15 -1.05217176 }}} == cbind() == * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects {{{#!highlight r > cbind(x,y) id score1 score2 score3 1 1 0.11440705 0.14536778 -1.1773241 2 2 -1.62862651 0.02020604 0.5686415 3 3 0.05335811 0.25462270 0.8844987 4 4 -0.19931734 0.15625511 0.9287316 5 5 -1.15217836 -1.79804503 -0.7550234 }}} * it is not recommended to use cbind() to combining data frames == merge() == * merge() is the command of choice for merging or joining data frames * it is the equivalent of join in sql * there are four cases * inner join * left outer join * right outer join * full outer join {{{#!highlight r > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3))) id day1 1 A 3 2 B 4 3 C 5 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4))) id day2 1 A 7 2 C 10 3 E 3 4 F 6 }}} == inner join == * inner join means: keep only the cases present in both of the data frames {{{#!highlight r > merge(d1,d2) id day1 day2 1 A 3 7 2 C 5 10 }}} == left outer join == * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T) {{{#!highlight r > merge(d1,d2,all.x = T) id day1 day2 1 A 3 7 2 B 4 NA 3 C 5 10 }}} == right outer join == * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T) {{{#!highlight r > merge(d1,d2,all.y = T) id day1 day2 1 A 3 7 2 C 5 10 3 E NA 3 4 F NA 6 }}} == full outer join == * full outer join means: keep all cases of both data frames (all=T) {{{#!highlight r > merge(d1,d2,all = T) id day1 day2 1 A 3 7 2 B 4 NA 3 C 5 10 4 E NA 3 5 F NA 6 }}} == merge() == * if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id} * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or * using\\ \texttt{by.x=c("colname1.x","colname2.x"), == merge() Exercise == * now read in the file personendaten.txt using the appropriate command * join the demographics with our pre1 data frame (even though it does not make sense now) == merge() Exercise == {{{#!highlight r > persdat <- read.table("../session1/session1data/personendaten.txt", + sep="\t", + header=T) > pre1 <- merge(persdat,pre1,all.y = T) > head(pre1) Subject Sex Age_PRETEST Trial Event.Type Code Time TTime Uncertainty 1 PRE001 f 3.11 7 Response 2 178963 10009 1 2 PRE001 f 3.11 12 Response 1 238680 8342 1 3 PRE001 f 3.11 17 Response 2 297789 8066 1 4 PRE001 f 3.11 22 Response 1 351321 10811 1 5 PRE001 f 3.11 27 Response 2 403607 713 1 6 PRE001 f 3.11 32 Response 1 467793 23709 1 Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index Type Event.Code 1 10197 2 0 next incorrect 7 Picture RO09.jpg 2 8398 2 0 next incorrect 12 Picture RO20.jpg 3 8198 2 0 next hit 17 Picture RS28.jpg 4 10997 2 0 next hit 22 Picture AT26.jpg 5 800 2 0 next hit 27 Picture RS23.jpg 6 23794 2 0 next hit 32 Picture OF04.jpg }}} == Reduce() == * is a higher order function (functional) * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list * it can be used if you have not only two but many data frames == Reduce() == * first we make up 4 artifical data frames == Reduce() == {{{#!highlight r > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3))) id day1 1 A 3 2 B 1 3 C 7 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4))) id day2 1 A 8 2 C 2 3 E 5 4 F 3 > (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4))) id day3 1 B 8 2 D 3 3 E 4 4 F 10 > (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5))) id day4 1 A 2 2 B 7 3 C 8 4 D 9 5 E 1 }}} == Reduce() == * now we use Reduce() in combination with merge() {{{#!highlight r > Reduce(merge,list(d1,d2,d3,d4)) [1] id day1 day2 day3 day4 }}} * and what we get is an empty data frame * well this isn't exactly what we wanted, so why? * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none * so we have to define a wrapper function which only change this argument to all=T == Reduce() == * now we use Reduce() in combination with merge() {{{#!highlight r > Reduce(function(x,y) { merge(x,y, all=T) }, + list(d1,d2,d3,d4)) id day1 day2 day3 day4 1 A 3 8 NA 2 2 B 1 NA 8 7 3 C 7 2 NA 8 4 E NA 5 4 1 5 F NA 3 10 NA 6 D NA NA 3 9 }}} * which is exactly what we want == Reduce() == * a second example in combination with rbind() {{{#!highlight r > d4$day <- names(d4)[2] > names(d4)[2] <- "score" > Reduce(function(x,y) { y$day <- names(y)[2] + names(y)[2] <- "score" + rbind(x,y) } , + list(d1,d2,d3), init = d4) id score day 1 A 2 day4 2 B 7 day4 3 C 8 day4 4 D 9 day4 }}} * which is exactly what we want == A second function == * well that's better, but it is still boring to do this for every single file * so see what we have learned: the combination of lapply() and Reduce() can do the work * using dir{} we get all the files contained in a given directory * then we use lapply() together with our new function read.file() == dir() == * dir() without additional argument shows all files/directories in the working directory {{{#!highlight r > dir() [1] "data" "function.r" "function.r~" [4] "ggp1.pdf" "graphics.r" "linkimage.aux" [7] "session2apply.aux" "session2apply.log" "session2apply.nav" [10] "session2apply.out" "session2apply.pdf" "session2apply.snm" [13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#" [16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux" [19] "session2hadley.log" "session2hadley.nav" "session2hadley.out" [22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex" [25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb" [28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r" [31] "solutionssession2.r~" "#solutionssession2.r#" }}} == dir() == * given a path dir() will show the content of resp folder {{{#!highlight r > dir("data") [1] "posttest" "pretest" "training_1" "training_2" "training_3" [6] "training_4" "training_5" "training_6" "training_7" "training_8" }}} == dir() == * setting recursive to TRUE R will recurse into directories recursively through {{{#!highlight r > dir("data",recursive = T) [1] "posttest/post_001.txt" "posttest/post_002.txt" [3] "posttest/post_003.txt" "posttest/post_004.txt" [5] "posttest/post_005.txt" "posttest/post_006.txt" [7] "posttest/post_007.txt" "posttest/post_008.txt" }}} == dir() == * setting full.names to TRUE R will give the full path {{{#!highlight r > dir("data",recursive = T, full.names = T) [1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" }}} == dir() == * with pattern we can specify which files to show (regexpr), e.g. all r files {{{#!highlight r > dir(pattern = "\\.r$") [1] "function.r" "graphics.r" "solutionssession1.r" [4] "solutionssession2.r" }}} == dir() Exercise == * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list == dir() Exercise == {{{#!highlight r > dir("data",full.names = T, recursive = T,pattern = "txt$" + ) [1] "data/posttest/post_001.txt" "data/posttest/post_002.txt" [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt" [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt" > files <- dir("data",full.names = T, recursive = T,pattern = "txt$") }}} == Read all files == Now we use lapply() and our function read.file() to read all files in files {{{#!highlight r > df.list <- lapply(files,read.file,skip=0) [1] "read data/posttest/post_001.txt" [1] "read data/posttest/post_002.txt" [1] "read data/posttest/post_003.txt" [1] "read data/posttest/post_004.txt" [1] "read data/posttest/post_005.txt" }}} == Reading all files == * the object df.list is a list containing 192 data frames {{{#!highlight r > sapply(df.list,class) [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [51] "data.frame" "NULL" "data.frame" "data.frame" "data.frame" [56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" [71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame" }}} == The Function 2 == * in a last step we use Reduce{} to combine these 192 data frames {{{#!highlight r > data <- Reduce(rbind,df.list) > nrow(data) [1] 12704 > table(data$Subject) 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1 96 95 94 90 96 95 91 96 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1 95 96 95 91 96 96 96 96 001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a 60 59 60 54 60 59 60 60 009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1 60 60 59 59 60 58 59 58 018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2 60 59 60 59 59 57 58 57 006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2 58 58 54 58 58 59 59 56 014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3 }}} == The Function no 2 == * so it is recommended to build again a function out of this {{{#!highlight r > read.files <- function(filesdir,skip=3,recursive=F,pattern="."){ + files <- dir(filesdir, + full.names = T, + recursive = recursive, + pattern = pattern) + Reduce(rbind,lapply(files,read.file,skip=skip))} > data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$") [1] "read data/posttest/post_001.txt" [1] "read data/posttest/post_002.txt" [1] "read data/posttest/post_003.txt" [1] "read data/posttest/post_004.txt" [1] "read data/posttest/post_005.txt" }}} == The Subject column == * table the Subject column again. What is the problem? == The Subject column == {{{#!highlight r > table(data$Subject) 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2 93 91 96 93 95 95 93 96 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2 92 94 95 96 96 95 96 94 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1 95 94 96 95 95 95 96 94 }}} * subject and time coded in one variable == The Subject column == * we create two new variables using the str\_split() function (stringr package) * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply() * then correct some of the person ids {{{#!highlight r > data$persid <- sapply(data$Subject,function(x) + str_split(x,pattern = "_")[[1]][1]) > data$testid <- sapply(data$Subject,function(x) + str_split(x,pattern = "_")[[1]][2]) > data$persid[data$persid=="CHGU"] <- "007" }}} == The Subject column Exercises == * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them! == The Subject column Exercises == {{{#!highlight r > data$persid[data$persid=="RMK"] <- "011" > data$persid[data$persid=="IJ2K"] <- "017" > data$persid[data$persid=="GA3K"] <- "004" > data$persid[data$persid=="Kj6K"] <- "006" }}} == Merging == * now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns) == The Subject column Exercises == {{{#!highlight r > persdat <- read.table("data/subjectdemographics.txt", + sep="\t", + header=T) > persdat$Subject [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > unique(data$persid) [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012" [13] "013" "014" "015" "016" "017" "018" "019" "020" > data$persid <- as.numeric(data$persid) > unique(data$persid) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T) In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid", : column name ‘Subject’ is duplicated in the result > head(data) Subject Sex Age_PRETEST Subject Trial Event.Type Code Time TTime 1 1 f 3.11 001_test2 7 Response 2 103745 2575 2 1 f 3.11 001_test2 12 Response 2 156493 2737 3 1 f 3.11 001_test2 17 Response 2 214772 6630 4 1 f 3.11 001_test2 22 Response 1 262086 5957 5 1 f 3.11 001_test2 27 Response 2 302589 272 6 1 f 3.11 001_test2 32 Response 1 352703 7197 }}} == Summary Graphics == Just run the code and try to understand it. We will cover the ggplot graphics in the next session. {{{#!highlight r > ggplot(data,aes(x=factor(Subject),fill=..count..)) + + geom_bar() + + facet_wrap(~testid) }}} <img alt='sesssion2/graph1.png' src='-1' /> == Summary Graphics == * so there are problems in coding of the test id * we remove the letters at the end using str\_replace() {{{#!highlight r > data$testid <- str_replace(data$testid,"[a-z]$","") > data$testid <- factor(data$testid, + levels=c("test1","1","2","3","4","5","6","7","8","test2")) > table(data$Subject,data$testid) test1 1 2 3 4 5 6 7 8 test2 1 95 60 59 60 59 59 60 60 60 93 2 95 59 59 58 60 60 60 60 60 91 3 96 60 57 60 60 60 60 59 58 96 4 94 54 58 60 60 55 53 60 58 93 5 96 60 57 60 60 60 60 60 0 95 6 95 59 58 59 58 59 55 54 55 95 7 94 60 58 60 58 59 60 59 59 93 8 90 60 54 55 60 60 60 59 60 96 9 96 60 58 59 57 0 0 58 56 92 10 95 60 58 58 60 60 58 0 0 94 11 91 59 59 60 60 57 58 60 60 95 }}} == Summary Graphics == {{{#!highlight r > ggplot(data,aes(x=factor(Subject),fill=..count..)) + + geom_bar() + + facet_wrap(~testid) }}} <img alt='sesssion2/graph2.png' src='-1' /> == Summary Graphics == And another one. {{{#!highlight r > ggplot(data,aes(x=testid,fill=Stim.Type)) + + geom_bar(position=position_fill()) + + facet_wrap(~Subject) }}} <img alt='sesssion2/graph3.png' src='-1' /> |
Introduction
Every function in R has three important characteristics:
- a body (the code inside the function) - body()
- arguments (the list of arguments which controls how you can call the function) - formals()
- an environment (the “map” of the location of the function’s variables) - environment()
You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.
Functions
Function Arguments
Arguments are matched
- first by exact name (perfect matching)
- then by prefix matching
- and finally by position.
By default, R function arguments are lazy, they are only evaluated if they are actually used:
Implicit Loops
Introduction
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:
- lapply()
- sapply()
- apply()
- tapply()
lapply()
- The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:
apply()
- apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
tapply()
- The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:
- the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame
- do the same with sapply() what is the difference
- try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
- calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
1 m <- matrix(rnorm(100),nrow=10)
use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling
1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
2 Sex Eth Days
3 1 F A 20.92105
4 2 M A 21.61290
5 3 F N 10.07143
6 4 M N 14.71429
7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
9 1 F A 0.00 5.25 13.50 20.92 30.25 81.00
10 2 M A 2.00 9.50 16.00 21.61 33.00 57.00
11 3 F N 0.00 5.00 7.00 10.07 14.00 37.00
12 4 M N 0.00 3.50 8.00 14.71 19.50 69.00
Function Exercises (Verzani)
- Write a function to compute the average distance from the mean for some data vector.
- Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10)
- An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()?
- Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0.
Function Exercises (Verzani) Solutions
- Write a function to compute the average distance from the mean for some data vector.
Function Exercises (Verzani) Solutions
- Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
Function Exercises (Verzani) Solutions
- An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
Function Exercises (Verzani) Solutions
- Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.