welcome: please sign in
location: Änderungen von "RstatisTik/RstatisTikPortal/RcourSe/CourseOutline/FunctionsInR/ApplyR"
Unterschiede zwischen den Revisionen 1 und 5 (über 4 Versionen hinweg)
Revision 1 vom 2015-05-01 08:23:34
Größe: 32159
Kommentar:
Revision 5 vom 2015-05-01 10:46:07
Größe: 6377
Kommentar:
Gelöschter Text ist auf diese Art markiert. Hinzugefügter Text ist auf diese Art markiert.
Zeile 2: Zeile 2:
== Implicit Loops ==
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure.
In R this is done by the functions:
   * lapply()
   * sapply()
   * apply()
   * tapply()
Every function in R has three important characteristics:

 * a body (the code inside the function) - body()
 * arguments (the list of arguments which controls how you can call the function) - formals()
 * an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without brackets. Exceptions are primitives. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

== Functions ==
{{{#!highlight r
> chisq.test
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
DNAME <- deparse(substitute(x))
if (is.data.frame(x))
expected = E, residuals = (x - E)/sqrt(E), stdres = (x - ...

> sum
function (..., na.rm = FALSE) .Primitive("sum")
}}}
== Function Arguments ==
Arguments are matched

 * first by exact name (perfect matching)
 * then by prefix matching
 * and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

{{{#!highlight r
> f <- function(x) {
f <- function(x) {
+ 10
+ }
> f(stop("This is an error!"))
[1] 10
>
}}}



= Implicit Loops =
== Introduction ==

A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is mainly done by the higher order functions:

 * lapply()
 * sapply()
 * apply()
 * tapply()
Zeile 10: Zeile 53:
   * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:  * The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:
Zeile 18: Zeile 62:
   * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example  * apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
Zeile 22: Zeile 67:
> apply(x,3,quantile) ## calculate the quantiles  > apply(x,3,quantile) ## calculate the quantiles
Zeile 25: Zeile 70:
   * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.  * The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
Zeile 27: Zeile 73:
Zeile 33: Zeile 80:
== ==
* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame
* do the same with sapply() what is the difference
* try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
* calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function 
 * the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame 
 * do the same with sapply() what is the difference 
 * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors 
 * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
Zeile 39: Zeile 86:
m <- matrix(rnorm(100),nrow=10)   m <- matrix(rnorm(100),nrow=10)
Zeile 41: Zeile 88:
* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines)
* sometimes the aggregate() function is more convenient; note the use of {{{#!latex \sim$; it is read as 'is dependent on'and it is extensively used in modelling
 * use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling
Zeile 55: Zeile 102:
4 M N 0.00 3.50 8.00 14.71 19.50 69.00
}}}
== Functions ==
Every function in R has three important characteristics:
   * a body (the code inside the function) - body()
   * arguments (the list of arguments which controls how you can call the function) - formals()
   * an environment (the “map” of the location of the function’s variables) - environment()
You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.
== Functions ==
{{{#!highlight r
> chisq.test
function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
DNAME <- deparse(substitute(x))
if (is.data.frame(x))
expected = E, residuals = (x - E)/sqrt(E), stdres = (x -
> sum
function (..., na.rm = FALSE) .Primitive("sum")
}}}
== Function Arguments ==
Arguments are matched
   * first by exact name (perfect matching)
   * then by prefix matching
   * and finally by position.
By default, R function arguments are lazy, they are only evaluated if they are actually used:
{{{#!highlight r
> f <- function(x) {
f <- function(x) {
+ 10
+ }
> f(stop("This is an error!"))
[1] 10
>
4 M N 0.00 3.50 8.00 14.71 19.50 69.00
Zeile 89: Zeile 105:
* Write a function to compute the average distance from the mean for some data vector.
* Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
* An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
* Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. 
== Function Exercises (Verzani) ==
   * Write a function to compute the average distance from the mean for some data vector.
 * Write a function to compute the average distance from the mean for some data vector. 
 * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10) 
 * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()? 
 * Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0.

== Function Exercises (Verzani) Solutions ==
 * Write a function to compute the average distance from the mean for some data vector.
Zeile 99: Zeile 117:
+ }   + }
Zeile 101: Zeile 119:
== Function Exercises (Verzani) ==
   * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
== Function Exercises (Verzani) Solutions ==
 * Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
Zeile 108: Zeile 127:
[1] 8.25   [1] 8.25
Zeile 110: Zeile 129:
== Function Exercises (Verzani) ==
   * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
== Function Exercises (Verzani) Solutions ==
 * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
Zeile 124: Zeile 144:
== Function Exercises (Verzani) ==
   * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0. 
== Function Exercises (Verzani) Solutions ==
 * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Zeile 136: Zeile 157:
[1] FALSE   [1] FALSE
Zeile 138: Zeile 159:
== Read in the file ==
{{{#!highlight r
> file <- "../session1/session1data/pre001.txt"
> skip <- 3
> tmp <- read.table(file,skip = skip,sep = "\t",
+ header=T,na.strings = c(" +",""),
+ fill=T)
}}}
== Remove empty line ==
{{{#!highlight r
> tmp <- tmp[!is.na(tmp$Subject),]
}}}
== Remove spaces ==
Remove unnecessary spaces from character vectors/factors
{{{#!highlight r
> tmp <- lapply(tmp,function(x) {
+ if( class(x) %in% c("character","factor") ){
+ x <- factor(gsub(" ","",as.character(x)))
+ return(x)}else{ return(x) }})
> tmp <- as.data.frame(tmp)
}}}
== Find/Remove breaks ==
{{{#!highlight r
> if(length(pause)>0){
+ drei <- which(tmp$Code==3 & !is.na(tmp$Code))
+ drei <- drei[drei > pause][1:2]
+ if(pause + 1 < drei[1]){
+ tmp <- tmp[-(pause:drei[2]),]
+ }}
> tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]
}}}
== Find/Remove first/last rows ==
{{{#!highlight r
> first.pic <- min(which(tmp$Event.Type=="Picture" &
+ !is.na(tmp$Event.Type) )) - 1
> tmp <- tmp[-(1:first.pic),]
> last.pic <- min(which(tmp$Event.Type=="Picture" &
+ !is.na(tmp$Event.Type) &
+ tmp$Code=="Fertig!" &
+ !is.na(tmp$Code)))
> tmp <- tmp[-(last.pic:nrow(tmp)),]
}}}
== Extract Responses ==
{{{#!highlight r
> zeilen <- which(tmp$Event.Type %in% c("Response"))
> zeilen <- sort(unique(c(zeilen,zeilen-1)))
> zeilen <- zeilen[zeilen>0]
> tmp <- tmp[zeilen,]
}}}
== Extract Responses ==
{{{#!highlight r
> responses <- which(tmp$Code %in% c(1,2))
> events <- responses-1
> tmp$Type <- NA
> tmp$Type[responses] <- as.character(tmp$Event.Type[events])
> head(tmp)
Subject Trial Event.Type Code Time TTime Uncertainty Duration
6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
7 PRE001 7 Response 2 178963 10009 1 NA
11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
12 PRE001 12 Response 1 238680 8342 1 NA
16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
17 PRE001 17 Response 2 297789 8066 1 NA
6 2 0 next incorrect 7 <NA>
7 NA NA <NA> <NA> NA Picture
11 2 0 next incorrect 12 <NA>
12 NA NA <NA> <NA> NA Picture
16 2 0 next hit 17 <NA>
17 NA NA <NA> <NA> NA Picture
}}}
== Moving Information ==
Moving all (necessary) information to the response lines.
{{{#!highlight r
> tmp$Event.Code <- NA
> tmp$Event.Code[responses] <- as.character(tmp$Code[events])
> tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
> tmp$Duration[responses] <- as.character(tmp$Duration[events])
> tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
> tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
> tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
> tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
> tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
}}}
== Moving Information ==
{{{#!highlight r
> head(tmp)
Subject Trial Event.Type Code Time TTime Uncertainty Duration
6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
7 PRE001 7 Response 2 178963 10009 1 10197
11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
12 PRE001 12 Response 1 238680 8342 1 8398
16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
17 PRE001 17 Response 2 297789 8066 1 8198
6 2 0 next incorrect 7 <NA> <NA>
7 2 0 next incorrect 7 Picture RO09.jpg
11 2 0 next incorrect 12 <NA> <NA>
12 2 0 next incorrect 12 Picture RO20.jpg
16 2 0 next hit 17 <NA> <NA>
17 2 0 next hit 17 Picture RS28.jpg
}}}
== Keep response lines ==
{{{#!highlight r
> tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
> tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
> head(tmp)
Subject Trial Event.Type Code Time TTime Uncertainty Duration
7 PRE001 7 Response 2 178963 10009 1 10197
12 PRE001 12 Response 1 238680 8342 1 8398
17 PRE001 17 Response 2 297789 8066 1 8198
22 PRE001 22 Response 1 351321 10811 1 10997
27 PRE001 27 Response 2 403607 713 1 800
32 PRE001 32 Response 1 467793 23709 1 23794
7 2 0 next incorrect 7 Picture RO09.jpg
12 2 0 next incorrect 12 Picture RO20.jpg
17 2 0 next hit 17 Picture RS28.jpg
22 2 0 next hit 22 Picture AT26.jpg
27 2 0 next hit 27 Picture RS23.jpg
32 2 0 next hit 32 Picture OF04.jpg
}}}
== The Function ==
   * it would be a tedious work to every step for all of the files
   * if we look through the steps the only important thing that we have to change is the file name
   * so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):
{{{#!highlight r
tmp <- read.table(file,skip = skip,sep = "\t",
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[!is.na(tmp$Subject),]
tmp <- lapply(tmp,function(x) {
x <- factor(gsub(" ","",as.character(x)))
tmp <- as.data.frame(tmp)
}}}
== The Function (continued) ==
{{{#!highlight r
pause <- which(tmp$Event.Type=="Picture" & tmp$Code=="Pause")
drei <- which(tmp$Code==3 & !is.na(tmp$Code))
drei <- drei[drei > pause][1:2]
tmp <- tmp[-(pause:drei[2]),]
tmp <- tmp[!(tmp$Event.Type %in% c("Pause","Resume")), ]
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[-(1:first.pic),]
tmp <- tmp[-(last.pic:nrow(tmp)),]
}}}
== The Function (continued) ==
{{{#!highlight r
zeilen <- which(tmp$Event.Type %in% c("Response"))
zeilen <- sort(unique(c(zeilen,zeilen-1)))
zeilen <- zeilen[zeilen>0]
tmp <- tmp[zeilen,]
responses <- which(tmp$Code %in% c(1,2))
events <- responses-1
}}}
== The Function (continued) ==
{{{#!highlight r
tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
}}}
== The Function (continued) ==
   * we can use this function now to read in the file
   * and get the processed data frame in one step
   * setting the parameter skip we can read both versions of the file (and should get the same result)
== The Function Exercise ==
* run the function using source()
* use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt}
* use some summary functions like table() or summary to check if they contain the same information
We will learn about a function to compare data frames more exact soon.
== The Function (continued) ==
{{{#!highlight r
> file <- "../session1/session1data/pre001.txt"
> pre1 <- read.file(file,skip=3)
[1] "read ../session1/session1data/pre001.txt"
> file <- "data/pretest/pre_001.txt"
> pre1v2 <- read.file(file,skip=0)
[1] "read ../session2/data/pretest/pre_001.txt"
}}}
== rbind() ==
   * rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects
{{{#!highlight r
> x <- data.frame(id=1:3,score=rnorm(3))
> y <- data.frame(id=13:15,score=rnorm(3))
> rbind(x,y)
id score
1 1 0.71121163
2 2 -0.62973249
3 3 1.17737595
4 13 -0.45074940
5 14 -0.01044197
6 15 -1.05217176
}}}
== cbind() ==
   * cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects
{{{#!highlight r
> cbind(x,y)
id score1 score2 score3
1 1 0.11440705 0.14536778 -1.1773241
2 2 -1.62862651 0.02020604 0.5686415
3 3 0.05335811 0.25462270 0.8844987
4 4 -0.19931734 0.15625511 0.9287316
5 5 -1.15217836 -1.79804503 -0.7550234
}}}
   * it is not recommended to use cbind() to combining data frames
== merge() ==
   * merge() is the command of choice for merging or joining data frames
   * it is the equivalent of join in sql
   * there are four cases
      * inner join
      * left outer join
      * right outer join
      * full outer join
{{{#!highlight r
> (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
id day1
1 A 3
2 B 4
3 C 5
> (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
id day2
1 A 7
2 C 10
3 E 3
4 F 6
}}}
== inner join ==
   * inner join means: keep only the cases present in both of the data frames
{{{#!highlight r
> merge(d1,d2)
id day1 day2
1 A 3 7
2 C 5 10
}}}
== left outer join ==
   * left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)
{{{#!highlight r
> merge(d1,d2,all.x = T)
id day1 day2
1 A 3 7
2 B 4 NA
3 C 5 10
}}}
== right outer join ==
   * right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)
{{{#!highlight r
> merge(d1,d2,all.y = T)
id day1 day2
1 A 3 7
2 C 5 10
3 E NA 3
4 F NA 6
}}}
== full outer join ==
   * full outer join means: keep all cases of both data frames (all=T)
{{{#!highlight r
> merge(d1,d2,all = T)
id day1 day2
1 A 3 7
2 B 4 NA
3 C 5 10
4 E NA 3
5 F NA 6
}}}
== merge() ==
   * if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
   * you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
   * using\\ \texttt{by.x=c("colname1.x","colname2.x"),
== merge() Exercise ==
   * now read in the file personendaten.txt using the appropriate command
   * join the demographics with our pre1 data frame (even though it does not make sense now)
== merge() Exercise ==
{{{#!highlight r
> persdat <- read.table("../session1/session1data/personendaten.txt",
+ sep="\t",
+ header=T)
> pre1 <- merge(persdat,pre1,all.y = T)
> head(pre1)
Subject Sex Age_PRETEST Trial Event.Type Code Time TTime Uncertainty
1 PRE001 f 3.11 7 Response 2 178963 10009 1
2 PRE001 f 3.11 12 Response 1 238680 8342 1
3 PRE001 f 3.11 17 Response 2 297789 8066 1
4 PRE001 f 3.11 22 Response 1 351321 10811 1
5 PRE001 f 3.11 27 Response 2 403607 713 1
6 PRE001 f 3.11 32 Response 1 467793 23709 1
Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index Type Event.Code
1 10197 2 0 next incorrect 7 Picture RO09.jpg
2 8398 2 0 next incorrect 12 Picture RO20.jpg
3 8198 2 0 next hit 17 Picture RS28.jpg
4 10997 2 0 next hit 22 Picture AT26.jpg
5 800 2 0 next hit 27 Picture RS23.jpg
6 23794 2 0 next hit 32 Picture OF04.jpg
}}}
== Reduce() ==
   * is a higher order function (functional)
   * Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
   * it can be used if you have not only two but many data frames
== Reduce() ==
   * first we make up 4 artifical data frames
== Reduce() ==
{{{#!highlight r
> (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
id day1
1 A 3
2 B 1
3 C 7
> (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
id day2
1 A 8
2 C 2
3 E 5
4 F 3
> (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
id day3
1 B 8
2 D 3
3 E 4
4 F 10
> (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
id day4
1 A 2
2 B 7
3 C 8
4 D 9
5 E 1
}}}
== Reduce() ==
   * now we use Reduce() in combination with merge()
{{{#!highlight r
> Reduce(merge,list(d1,d2,d3,d4))
[1] id day1 day2 day3 day4
}}}
   * and what we get is an empty data frame
   * well this isn't exactly what we wanted, so why?
   * it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
   * so we have to define a wrapper function which only change this argument to all=T
== Reduce() ==
   * now we use Reduce() in combination with merge()
{{{#!highlight r
> Reduce(function(x,y) { merge(x,y, all=T) },
+ list(d1,d2,d3,d4))
id day1 day2 day3 day4
1 A 3 8 NA 2
2 B 1 NA 8 7
3 C 7 2 NA 8
4 E NA 5 4 1
5 F NA 3 10 NA
6 D NA NA 3 9
}}}
   * which is exactly what we want
== Reduce() ==
   * a second example in combination with rbind()
{{{#!highlight r
> d4$day <- names(d4)[2]
> names(d4)[2] <- "score"
> Reduce(function(x,y) { y$day <- names(y)[2]
+ names(y)[2] <- "score"
+ rbind(x,y) } ,
+ list(d1,d2,d3), init = d4)
id score day
1 A 2 day4
2 B 7 day4
3 C 8 day4
4 D 9 day4
}}}
   * which is exactly what we want
== A second function ==
   * well that's better, but it is still boring to do this for every single file
   * so see what we have learned: the combination of lapply() and Reduce() can do the work
   * using dir{} we get all the files contained in a given directory
   * then we use lapply() together with our new function read.file()
== dir() ==
   * dir() without additional argument shows all files/directories in the working directory
{{{#!highlight r
> dir()
[1] "data" "function.r" "function.r~"
[4] "ggp1.pdf" "graphics.r" "linkimage.aux"
[7] "session2apply.aux" "session2apply.log" "session2apply.nav"
[10] "session2apply.out" "session2apply.pdf" "session2apply.snm"
[13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#"
[16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux"
[19] "session2hadley.log" "session2hadley.nav" "session2hadley.out"
[22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex"
[25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb"
[28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r"
[31] "solutionssession2.r~" "#solutionssession2.r#"
}}}
== dir() ==
   * given a path dir() will show the content of resp folder
{{{#!highlight r
> dir("data")
[1] "posttest" "pretest" "training_1" "training_2" "training_3"
[6] "training_4" "training_5" "training_6" "training_7" "training_8"
}}}
== dir() ==
   * setting recursive to TRUE R will recurse into directories recursively through
{{{#!highlight r
> dir("data",recursive = T)
[1] "posttest/post_001.txt" "posttest/post_002.txt"
[3] "posttest/post_003.txt" "posttest/post_004.txt"
[5] "posttest/post_005.txt" "posttest/post_006.txt"
[7] "posttest/post_007.txt" "posttest/post_008.txt"
}}}
== dir() ==
   * setting full.names to TRUE R will give the full path
{{{#!highlight r
> dir("data",recursive = T, full.names = T)
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
}}}
== dir() ==
   * with pattern we can specify which files to show (regexpr), e.g. all r files
{{{#!highlight r
> dir(pattern = "\\.r$")
[1] "function.r" "graphics.r" "solutionssession1.r"
[4] "solutionssession2.r"
}}}
== dir() Exercise ==
   * create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list
== dir() Exercise ==
{{{#!highlight r
> dir("data",full.names = T, recursive = T,pattern = "txt$"
+ )
[1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
[3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
[5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
> files <- dir("data",full.names = T, recursive = T,pattern = "txt$")
}}}
== Read all files ==
Now we use lapply() and our function read.file() to read all files in files
{{{#!highlight r
> df.list <- lapply(files,read.file,skip=0)
[1] "read data/posttest/post_001.txt"
[1] "read data/posttest/post_002.txt"
[1] "read data/posttest/post_003.txt"
[1] "read data/posttest/post_004.txt"
[1] "read data/posttest/post_005.txt"
}}}
== Reading all files ==
   * the object df.list is a list containing 192 data frames
{{{#!highlight r
> sapply(df.list,class)
[1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[51] "data.frame" "NULL" "data.frame" "data.frame" "data.frame"
[56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
[71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
}}}
== The Function 2 ==
   * in a last step we use Reduce{} to combine these 192 data frames
{{{#!highlight r
> data <- Reduce(rbind,df.list)
> nrow(data)
[1] 12704
> table(data$Subject)
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
93 91 96 93 95 95 93 96
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
92 94 95 96 96 95 96 94
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
95 94 96 95 95 95 96 94
005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1
96 95 94 90 96 95 91 96
013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1
95 96 95 91 96 96 96 96
001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a
60 59 60 54 60 59 60 60
009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1
60 60 59 59 60 58 59 58
018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2
60 59 60 59 59 57 58 57
006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2
58 58 54 58 58 59 59 56
014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3
}}}
== The Function no 2 ==
   * so it is recommended to build again a function out of this
{{{#!highlight r
> read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
+ files <- dir(filesdir,
+ full.names = T,
+ recursive = recursive,
+ pattern = pattern)
+ Reduce(rbind,lapply(files,read.file,skip=skip))}
> data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
[1] "read data/posttest/post_001.txt"
[1] "read data/posttest/post_002.txt"
[1] "read data/posttest/post_003.txt"
[1] "read data/posttest/post_004.txt"
[1] "read data/posttest/post_005.txt"
}}}
== The Subject column ==
   * table the Subject column again. What is the problem?
== The Subject column ==
{{{#!highlight r
> table(data$Subject)
001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
93 91 96 93 95 95 93 96
009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
92 94 95 96 96 95 96 94
017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
95 94 96 95 95 95 96 94
}}}
   * subject and time coded in one variable
== The Subject column ==
   * we create two new variables using the str\_split() function (stringr package)
   * becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
   * then correct some of the person ids
{{{#!highlight r
> data$persid <- sapply(data$Subject,function(x)
+ str_split(x,pattern = "_")[[1]][1])
> data$testid <- sapply(data$Subject,function(x)
+ str_split(x,pattern = "_")[[1]][2])
> data$persid[data$persid=="CHGU"] <- "007"
}}}
== The Subject column Exercises ==
   * there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!
== The Subject column Exercises ==
{{{#!highlight r
> data$persid[data$persid=="RMK"] <- "011"
> data$persid[data$persid=="IJ2K"] <- "017"
> data$persid[data$persid=="GA3K"] <- "004"
> data$persid[data$persid=="Kj6K"] <- "006"
}}}
== Merging ==
* now read in the file subjectsdemographics.txt using the appropriate command
* join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
== The Subject column Exercises ==
{{{#!highlight r
> persdat <- read.table("data/subjectdemographics.txt",
+ sep="\t",
+ header=T)
> persdat$Subject
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> unique(data$persid)
[1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
[13] "013" "014" "015" "016" "017" "018" "019" "020"
> data$persid <- as.numeric(data$persid)
> unique(data$persid)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid", :
column name ‘Subject’ is duplicated in the result
> head(data)
Subject Sex Age_PRETEST Subject Trial Event.Type Code Time TTime
1 1 f 3.11 001_test2 7 Response 2 103745 2575
2 1 f 3.11 001_test2 12 Response 2 156493 2737
3 1 f 3.11 001_test2 17 Response 2 214772 6630
4 1 f 3.11 001_test2 22 Response 1 262086 5957
5 1 f 3.11 001_test2 27 Response 2 302589 272
6 1 f 3.11 001_test2 32 Response 1 352703 7197
}}}
== Summary Graphics ==
Just run the code and try to understand it. We will cover the ggplot graphics in the next session.
{{{#!highlight r
> ggplot(data,aes(x=factor(Subject),fill=..count..)) +
+ geom_bar() +
+ facet_wrap(~testid)
}}}
<img alt='sesssion2/graph1.png' src='-1' />
== Summary Graphics ==
   * so there are problems in coding of the test id
   * we remove the letters at the end using str\_replace()
{{{#!highlight r
> data$testid <- str_replace(data$testid,"[a-z]$","")
> data$testid <- factor(data$testid,
+ levels=c("test1","1","2","3","4","5","6","7","8","test2"))
> table(data$Subject,data$testid)
test1 1 2 3 4 5 6 7 8 test2
1 95 60 59 60 59 59 60 60 60 93
2 95 59 59 58 60 60 60 60 60 91
3 96 60 57 60 60 60 60 59 58 96
4 94 54 58 60 60 55 53 60 58 93
5 96 60 57 60 60 60 60 60 0 95
6 95 59 58 59 58 59 55 54 55 95
7 94 60 58 60 58 59 60 59 59 93
8 90 60 54 55 60 60 60 59 60 96
9 96 60 58 59 57 0 0 58 56 92
10 95 60 58 58 60 60 58 0 0 94
11 91 59 59 60 60 57 58 60 60 95
}}}
== Summary Graphics ==
{{{#!highlight r
> ggplot(data,aes(x=factor(Subject),fill=..count..)) +
+ geom_bar() +
+ facet_wrap(~testid)
}}}
<img alt='sesssion2/graph2.png' src='-1' />
== Summary Graphics ==
And another one.
{{{#!highlight r
> ggplot(data,aes(x=testid,fill=Stim.Type)) +
+ geom_bar(position=position_fill()) +
+ facet_wrap(~Subject)
}}}
<img alt='sesssion2/graph3.png' src='-1' />

Introduction

Every function in R has three important characteristics:

  • a body (the code inside the function) - body()
  • arguments (the list of arguments which controls how you can call the function) - formals()
  • an environment (the “map” of the location of the function’s variables) - environment()

You can see all three parts if you type the name of the function without brackets. Exceptions are primitives. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.

Functions

   1 > chisq.test
   2 function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)),
   3 DNAME <- deparse(substitute(x))
   4 if (is.data.frame(x))
   5 expected = E, residuals = (x - E)/sqrt(E), stdres = (x - ...
   6 
   7 > sum
   8 function (..., na.rm = FALSE)  .Primitive("sum")

Function Arguments

Arguments are matched

  • first by exact name (perfect matching)
  • then by prefix matching
  • and finally by position.

By default, R function arguments are lazy, they are only evaluated if they are actually used:

   1 > f <- function(x) {
   2 f <- function(x) {
   3 +   10
   4 + }
   5 > f(stop("This is an error!"))
   6 [1] 10
   7 >

Implicit Loops

Introduction

A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is mainly done by the higher order functions:

  • lapply()
  • sapply()
  • apply()
  • tapply()

lapply()

  • The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:

   1 > lapply(dat,mean)
   2 [1] 6753.636
   3 [1] 5433.182
   4 > sapply(dat,mean)

apply()

  • apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example

   1 > x<-1:12
   2 > dim(x)<-c(2,2,3)
   3 > apply(x,3,quantile) ## calculate the quantiles

tapply()

  • The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.

For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:

   1 > tapply(Days,list(Eth,Lrn),mean)
   2 AL       SL
   3 A 18.57500 24.89655
   4 N 13.25581 10.82353
  • the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame
  • do the same with sapply() what is the difference
  • try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors
  • calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function

   1 m <- matrix(rnorm(100),nrow=10)
  • use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of  #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling

   1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
   2 Sex Eth     Days
   3 1   F   A 20.92105
   4 2   M   A 21.61290
   5 3   F   N 10.07143
   6 4   M   N 14.71429
   7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
   8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
   9 1   F   A      0.00         5.25       13.50     20.92        30.25     81.00
  10 2   M   A      2.00         9.50       16.00     21.61        33.00     57.00
  11 3   F   N      0.00         5.00        7.00     10.07        14.00     37.00
  12 4   M   N      0.00         3.50        8.00     14.71        19.50     69.00

Function Exercises (Verzani)

  • Write a function to compute the average distance from the mean for some data vector.
  • Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:10)
  • An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x \%\% 2. Use this to write a function iseven(). How would you write isodd()?
  • Write a function isprime() that checks if a number x is prime by dividing x by all values from 2,...,x-1 then checking to see if there is a remainder of 0.

Function Exercises (Verzani) Solutions

  • Write a function to compute the average distance from the mean for some data vector.

   1 > avg.dist <- function(x){
   2 +     xbar <- mean(x)
   3 +     mean(abs(x-xbar))
   4 + }

Function Exercises (Verzani) Solutions

  • Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}

   1 > f <- function(x){
   2 +     mean(x**2) - mean(x)**2
   3 + }
   4 > f(1:10)
   5 [1] 8.25

Function Exercises (Verzani) Solutions

  • An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?

   1 > iseven <- function(x){
   2 +     x %% 2 == 0
   3 + }
   4 > iseven(1:10)
   5 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
   6 > isodd <- function(x){
   7 +     !iseven(x)
   8 + }
   9 > isodd(1:10)
  10 [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Function Exercises (Verzani) Solutions

  • Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.

   1 > isprime <- function(x){
   2 +     if(x == 2) return(TRUE)
   3 +     !(0 %in% (x %% (2:(x-1))))
   4 + }
   5 > isprime(2)
   6 [1] TRUE
   7 > isprime(5)
   8 [1] TRUE
   9 > isprime(15)
  10 [1] FALSE

RstatisTik/RstatisTikPortal/RcourSe/CourseOutline/FunctionsInR/ApplyR (zuletzt geändert am 2015-05-01 10:48:36 durch mandy.vogel@googlemail.com)