⇤ ← Revision 1 vom 2015-05-01 08:23:34
Größe: 32159
Kommentar:
|
Größe: 32164
Kommentar:
|
Gelöschter Text ist auf diese Art markiert. | Hinzugefügter Text ist auf diese Art markiert. |
Zeile 42: | Zeile 42: |
* sometimes the aggregate() function is more convenient; note the use of {{{#!latex \sim$; it is read as 'is dependent on'and it is extensively used in modelling | * sometimes the aggregate() function is more convenient; note the use of {{{ #!latex $\sim$;}}} it is read as 'is dependent on'and it is extensively used in modelling |
Introduction
Implicit Loops
A common application of loops is to apply a function to each element of a set of values and collect the results in a single structure. In R this is done by the functions:
- lapply()
- sapply()
- apply()
- tapply()
lapply()
- The functions lapply and sapply are similar, their first argument can be a list, data frame, matrix or vector, the second argument the function to "apply". The former return a list (hence "l") and the latter tries to simplify the results (hence the "s"). For example:
apply()
- apply() this function can be applied to an array. Its argument is the array, the second the dimension/s where we want to apply a function and the third is the function. For example
tapply()
- The function tapply() allows you to create tables (hence the "t") of the value of a function on subgroups defined by its second argument, which can be a factor or a list of factors.
For example in the quine data frame, we can summarize Days classify by Eth and Lrn as follows:
* the class() function shows the class of an object use it in combination with lapply() to get the classes of the columns of the quine data frame * do the same with sapply() what is the difference * try to combine this with what you learned about indexing and create a new data frame quine2 only containing the columns which are factors * calculate the row and column means of the below defined matrix m using the apply function PS: in real life application use the rowMeans() and colMeans() function
1 m <- matrix(rnorm(100),nrow=10)
* use tapply() to summarise the number of missing days at school per Ethnicity and/or per Sex (three lines) * sometimes the aggregate() function is more convenient; note the use of #!latex $\sim$; it is read as 'is dependent on'and it is extensively used in modelling
1 > aggregate(Days ~ Sex + Eth, data=quine,mean)
2 Sex Eth Days
3 1 F A 20.92105
4 2 M A 21.61290
5 3 F N 10.07143
6 4 M N 14.71429
7 > aggregate(Days ~ Sex + Eth, data=quine,summary)
8 Sex Eth Days.Min. Days.1st Qu. Days.Median Days.Mean Days.3rd Qu. Days.Max.
9 1 F A 0.00 5.25 13.50 20.92 30.25 81.00
10 2 M A 2.00 9.50 16.00 21.61 33.00 57.00
11 3 F N 0.00 5.00 7.00 10.07 14.00 37.00
12 4 M N 0.00 3.50 8.00 14.71 19.50 69.00
Functions
Every function in R has three important characteristics:
- a body (the code inside the function) - body()
- arguments (the list of arguments which controls how you can call the function) - formals()
- an environment (the “map” of the location of the function’s variables) - environment()
You can see all three parts if you type the name of the function without primitives. Exceptions are brackets. Primitive functions, like sum(), call C code directly with .Primitive() and contain no R code. Therefore their formals(), body(), and environment() are all NULL.
Functions
Function Arguments
Arguments are matched
- first by exact name (perfect matching)
- then by prefix matching
- and finally by position.
By default, R function arguments are lazy, they are only evaluated if they are actually used:
Function Exercises (Verzani)
* Write a function to compute the average distance from the mean for some data vector. * Write a function f() which finds the average of the x values after squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)} * An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()? * Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Function Exercises (Verzani)
- Write a function to compute the average distance from the mean for some data vector.
Function Exercises (Verzani)
- Write a function f() which finds the average of the x values aufter squaring and substracts the square of the average of the numbers. Verify this output will always be non-negative by computing \texttt{f(1:10)}
Function Exercises (Verzani)
- An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax \texttt{ x \%\% 2}. Use this to write a function iseven(). How would you write isodd()?
Function Exercises (Verzani)
- Write a function isprime() that checks if a number x is prime by dividing x by all values \texttt{$2,\ldots,x-1}}}} then checking to see if there is a remainder of 0.
Read in the file
Remove empty line
1 > tmp <- tmp[!is.na(tmp$Subject),]
Remove spaces
Remove unnecessary spaces from character vectors/factors
Find/Remove breaks
Find/Remove first/last rows
Extract Responses
Extract Responses
1 > responses <- which(tmp$Code %in% c(1,2))
2 > events <- responses-1
3 > tmp$Type <- NA
4 > tmp$Type[responses] <- as.character(tmp$Event.Type[events])
5 > head(tmp)
6 Subject Trial Event.Type Code Time TTime Uncertainty Duration
7 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
8 7 PRE001 7 Response 2 178963 10009 1 NA
9 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
10 12 PRE001 12 Response 1 238680 8342 1 NA
11 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
12 17 PRE001 17 Response 2 297789 8066 1 NA
13 6 2 0 next incorrect 7 <NA>
14 7 NA NA <NA> <NA> NA Picture
15 11 2 0 next incorrect 12 <NA>
16 12 NA NA <NA> <NA> NA Picture
17 16 2 0 next hit 17 <NA>
18 17 NA NA <NA> <NA> NA Picture
Moving Information
Moving all (necessary) information to the response lines.
1 > tmp$Event.Code <- NA
2 > tmp$Event.Code[responses] <- as.character(tmp$Code[events])
3 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
4 > tmp$Duration[responses] <- as.character(tmp$Duration[events])
5 > tmp$Uncertainty.1[responses] <- as.character(tmp$Uncertainty.1[events])
6 > tmp$ReqTime[responses] <- as.character(tmp$ReqTime[events])
7 > tmp$ReqDur[responses] <- as.character(tmp$ReqDur[events])
8 > tmp$Pair.Index[responses] <- as.character(tmp$Pair.Index[events])
9 > tmp$Stim.Type[responses] <- as.character(tmp$Stim.Type[events])
Moving Information
1 > head(tmp)
2 Subject Trial Event.Type Code Time TTime Uncertainty Duration
3 6 PRE001 7 Picture RO09.jpg 168954 0 1 10197
4 7 PRE001 7 Response 2 178963 10009 1 10197
5 11 PRE001 12 Picture RO20.jpg 230338 0 1 8398
6 12 PRE001 12 Response 1 238680 8342 1 8398
7 16 PRE001 17 Picture RS28.jpg 289723 0 1 8198
8 17 PRE001 17 Response 2 297789 8066 1 8198
9 6 2 0 next incorrect 7 <NA> <NA>
10 7 2 0 next incorrect 7 Picture RO09.jpg
11 11 2 0 next incorrect 12 <NA> <NA>
12 12 2 0 next incorrect 12 Picture RO20.jpg
13 16 2 0 next hit 17 <NA> <NA>
14 17 2 0 next hit 17 Picture RS28.jpg
Keep response lines
1 > tmp <- tmp[tmp$Event.Type=="Response" & !is.na(tmp$Type),]
2 > tmp <- tmp[tmp$Type=="Picture" & !is.na(tmp$Type),]
3 > head(tmp)
4 Subject Trial Event.Type Code Time TTime Uncertainty Duration
5 7 PRE001 7 Response 2 178963 10009 1 10197
6 12 PRE001 12 Response 1 238680 8342 1 8398
7 17 PRE001 17 Response 2 297789 8066 1 8198
8 22 PRE001 22 Response 1 351321 10811 1 10997
9 27 PRE001 27 Response 2 403607 713 1 800
10 32 PRE001 32 Response 1 467793 23709 1 23794
11 7 2 0 next incorrect 7 Picture RO09.jpg
12 12 2 0 next incorrect 12 Picture RO20.jpg
13 17 2 0 next hit 17 Picture RS28.jpg
14 22 2 0 next hit 22 Picture AT26.jpg
15 27 2 0 next hit 27 Picture RS23.jpg
16 32 2 0 next hit 32 Picture OF04.jpg
The Function
- it would be a tedious work to every step for all of the files
- if we look through the steps the only important thing that we have to change is the file name
- so we rather use a canned version of our procedure dependend the file name and the number of lines to skip - we create a function read.file(file):
1 tmp <- read.table(file,skip = skip,sep = "\t",
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
The Function (continued)
- we can use this function now to read in the file
- and get the processed data frame in one step
- setting the parameter skip we can read both versions of the file (and should get the same result)
The Function Exercise
* run the function using source() * use the function to read in \texttt{../session1/session1data/pre001.txt} and \texttt{data/pretest/pre\_001.txt} * use some summary functions like table() or summary to check if they contain the same information We will learn about a function to compare data frames more exact soon.
The Function (continued)
rbind()
- rbind() can be used to combine two dataframes (or matrices) in the sense of adding rows, the column names and types must be the same for the two objects
cbind()
- cbind() can be used to combine two dataframes (or matrices) in the sense of adding columns, the number of rows must be the same for the two objects
- it is not recommended to use cbind() to combining data frames
merge()
- merge() is the command of choice for merging or joining data frames
- it is the equivalent of join in sql
- there are four cases
- inner join
- left outer join
- right outer join
- full outer join
inner join
- inner join means: keep only the cases present in both of the data frames
left outer join
- left outer join means: keep all cases of the left data frame no matter if they are present in the right data frame (all.x=T)
right outer join
- right outer join means: keep all cases of the right data frame no matter if they are present in the left data frame (all.y=T)
full outer join
- full outer join means: keep all cases of both data frames (all=T)
merge()
- if not stated otherwise R uses the intersect of the names of both data frames, in our case only \textit{id}
- you can specify these columns directly by \texttt{by=c("colname1","colname2")} if the columns are named identical or
- using\\ \texttt{by.x=c("colname1.x","colname2.x"),
merge() Exercise
- now read in the file personendaten.txt using the appropriate command
- join the demographics with our pre1 data frame (even though it does not make sense now)
merge() Exercise
1 > persdat <- read.table("../session1/session1data/personendaten.txt",
2 + sep="\t",
3 + header=T)
4 > pre1 <- merge(persdat,pre1,all.y = T)
5 > head(pre1)
6 Subject Sex Age_PRETEST Trial Event.Type Code Time TTime Uncertainty
7 1 PRE001 f 3.11 7 Response 2 178963 10009 1
8 2 PRE001 f 3.11 12 Response 1 238680 8342 1
9 3 PRE001 f 3.11 17 Response 2 297789 8066 1
10 4 PRE001 f 3.11 22 Response 1 351321 10811 1
11 5 PRE001 f 3.11 27 Response 2 403607 713 1
12 6 PRE001 f 3.11 32 Response 1 467793 23709 1
13 Duration Uncertainty.1 ReqTime ReqDur Stim.Type Pair.Index Type Event.Code
14 1 10197 2 0 next incorrect 7 Picture RO09.jpg
15 2 8398 2 0 next incorrect 12 Picture RO20.jpg
16 3 8198 2 0 next hit 17 Picture RS28.jpg
17 4 10997 2 0 next hit 22 Picture AT26.jpg
18 5 800 2 0 next hit 27 Picture RS23.jpg
19 6 23794 2 0 next hit 32 Picture OF04.jpg
Reduce()
- is a higher order function (functional)
- Reduce() uses a binary function (like rbind() or merge()) to combine successively the elements of a given list
- it can be used if you have not only two but many data frames
Reduce()
- first we make up 4 artifical data frames
Reduce()
1 > (d1 <- data.frame(id=LETTERS[c(1,2,3)],day1=sample(10,3)))
2 id day1
3 1 A 3
4 2 B 1
5 3 C 7
6 > (d2 <- data.frame(id=LETTERS[c(1,3,5,6)],day2=sample(10,4)))
7 id day2
8 1 A 8
9 2 C 2
10 3 E 5
11 4 F 3
12 > (d3 <- data.frame(id=LETTERS[c(2,4:6)],day3=sample(10,4)))
13 id day3
14 1 B 8
15 2 D 3
16 3 E 4
17 4 F 10
18 > (d4 <- data.frame(id=LETTERS[c(1:5)],day4=sample(10,5)))
19 id day4
20 1 A 2
21 2 B 7
22 3 C 8
23 4 D 9
24 5 E 1
Reduce()
- now we use Reduce() in combination with merge()
- and what we get is an empty data frame
- well this isn't exactly what we wanted, so why?
- it is because the default behavior of merge() is set all=F, so we get only complete lines which is in this case - none
- so we have to define a wrapper function which only change this argument to all=T
Reduce()
- now we use Reduce() in combination with merge()
- which is exactly what we want
Reduce()
- a second example in combination with rbind()
- which is exactly what we want
A second function
- well that's better, but it is still boring to do this for every single file
- so see what we have learned: the combination of lapply() and Reduce() can do the work
- using dir{} we get all the files contained in a given directory
- then we use lapply() together with our new function read.file()
dir()
- dir() without additional argument shows all files/directories in the working directory
1 > dir()
2 [1] "data" "function.r" "function.r~"
3 [4] "ggp1.pdf" "graphics.r" "linkimage.aux"
4 [7] "session2apply.aux" "session2apply.log" "session2apply.nav"
5 [10] "session2apply.out" "session2apply.pdf" "session2apply.snm"
6 [13] "session2apply.tex" "session2apply.tex~" "#session2apply.tex#"
7 [16] "session2apply.toc" "session2apply.vrb" "session2hadley.aux"
8 [19] "session2hadley.log" "session2hadley.nav" "session2hadley.out"
9 [22] "session2hadley.pdf" "session2hadley.snm" "session2hadley.tex"
10 [25] "session2hadley.tex~" "session2hadley.toc" "session2hadley.vrb"
11 [28] "solutionssession1.r" "solutionssession1.r~" "solutionssession2.r"
12 [31] "solutionssession2.r~" "#solutionssession2.r#"
dir()
- given a path dir() will show the content of resp folder
dir()
- setting recursive to TRUE R will recurse into directories recursively through
dir()
- setting full.names to TRUE R will give the full path
dir()
- with pattern we can specify which files to show (regexpr), e.g. all r files
dir() Exercise
- create a variable files containing the names of all text files in the data directory, my editor creates temporary files beginning and ending by a hash key, make sure they are not contained in the list
dir() Exercise
1 > dir("data",full.names = T, recursive = T,pattern = "txt$"
2 + )
3 [1] "data/posttest/post_001.txt" "data/posttest/post_002.txt"
4 [3] "data/posttest/post_003.txt" "data/posttest/post_004.txt"
5 [5] "data/posttest/post_005.txt" "data/posttest/post_006.txt"
6 > files <- dir("data",full.names = T, recursive = T,pattern = "txt$")
Read all files
Now we use lapply() and our function read.file() to read all files in files
Reading all files
- the object df.list is a list containing 192 data frames
1 > sapply(df.list,class)
2 [1] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
3 [6] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
4 [11] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
5 [16] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
6 [21] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
7 [26] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
8 [31] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
9 [36] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
10 [41] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
11 [46] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
12 [51] "data.frame" "NULL" "data.frame" "data.frame" "data.frame"
13 [56] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
14 [61] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
15 [66] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
16 [71] "data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
The Function 2
- in a last step we use Reduce{} to combine these 192 data frames
1 > data <- Reduce(rbind,df.list)
2 > nrow(data)
3 [1] 12704
4 > table(data$Subject)
5 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
6 93 91 96 93 95 95 93 96
7 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
8 92 94 95 96 96 95 96 94
9 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
10 95 94 96 95 95 95 96 94
11 005_test1 006_test1 007_test1 008_test1 009_test1 010_test1 011_test1 012_test1
12 96 95 94 90 96 95 91 96
13 013_test1 014_test1 015_test1 016_test1 017_test1 018_test1 019_test1 020_test1
14 95 96 95 91 96 96 96 96
15 001_1 002_1 003_1 004_1 005_1 006_1 CHGU_1 008_1a
16 60 59 60 54 60 59 60 60
17 009_1 010_1 RMK_1 013_1 014_1 015_1 016_1 IJ2K_1
18 60 60 59 59 60 58 59 58
19 018_1 019_1 020_1 001_2 002_2 003_2 004_2 005_2
20 60 59 60 59 59 57 58 57
21 006_2 007_2 008_2 009_2 010_2 011_2 012_2 013_2
22 58 58 54 58 58 59 59 56
23 014_2 015_2 016_2 017_2 018_2 019_2 020_2 001_3
The Function no 2
- so it is recommended to build again a function out of this
1 > read.files <- function(filesdir,skip=3,recursive=F,pattern="."){
2 + files <- dir(filesdir,
3 + full.names = T,
4 + recursive = recursive,
5 + pattern = pattern)
6 + Reduce(rbind,lapply(files,read.file,skip=skip))}
7 > data <- read.files("data",recursive = T,skip=0,pattern = "\\.txt$")
8 [1] "read data/posttest/post_001.txt"
9 [1] "read data/posttest/post_002.txt"
10 [1] "read data/posttest/post_003.txt"
11 [1] "read data/posttest/post_004.txt"
12 [1] "read data/posttest/post_005.txt"
The Subject column
- table the Subject column again. What is the problem?
The Subject column
1 > table(data$Subject)
2 001_test2 002_test2 003_test2 004_test2 005_test2 006_test2 007_test2 008_test2
3 93 91 96 93 95 95 93 96
4 009_test2 010_test2 011_test2 012_test2 013_test2 014_test2 015_test2 016_test2
5 92 94 95 96 96 95 96 94
6 017_test2 018_test2 019_test2 020_test2 001_test1 002_test1 003_test1 004_test1
7 95 94 96 95 95 95 96 94
- subject and time coded in one variable
The Subject column
- we create two new variables using the str\_split() function (stringr package)
- becaus str\_split() has a list containing a vector as result we have to use it in combination with sapply()
- then correct some of the person ids
The Subject column Exercises
- there are some more wrong person ids: RMK - 011, IJ2K - 017, GA3K - 004, Kj6K - 006. Correct them!
The Subject column Exercises
Merging
* now read in the file subjectsdemographics.txt using the appropriate command * join the demographics with our data data frame (there is a little problem left - compare the persid and Subject columns)
The Subject column Exercises
1 > persdat <- read.table("data/subjectdemographics.txt",
2 + sep="\t",
3 + header=T)
4 > persdat$Subject
5 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6 > unique(data$persid)
7 [1] "001" "002" "003" "004" "005" "006" "007" "008" "009" "010" "011" "012"
8 [13] "013" "014" "015" "016" "017" "018" "019" "020"
9 > data$persid <- as.numeric(data$persid)
10 > unique(data$persid)
11 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
12 > data <- merge(persdat,data,by.x = "Subject",by.y = "persid",all=T)
13 In merge.data.frame(persdat, data, by.x = "Subject", by.y = "persid", :
14 column name ‘Subject’ is duplicated in the result
15 > head(data)
16 Subject Sex Age_PRETEST Subject Trial Event.Type Code Time TTime
17 1 1 f 3.11 001_test2 7 Response 2 103745 2575
18 2 1 f 3.11 001_test2 12 Response 2 156493 2737
19 3 1 f 3.11 001_test2 17 Response 2 214772 6630
20 4 1 f 3.11 001_test2 22 Response 1 262086 5957
21 5 1 f 3.11 001_test2 27 Response 2 302589 272
22 6 1 f 3.11 001_test2 32 Response 1 352703 7197
Summary Graphics
Just run the code and try to understand it. We will cover the ggplot graphics in the next session.
<img alt='sesssion2/graph1.png' src='-1' />
Summary Graphics
- so there are problems in coding of the test id
- we remove the letters at the end using str\_replace()
1 > data$testid <- str_replace(data$testid,"[a-z]$","")
2 > data$testid <- factor(data$testid,
3 + levels=c("test1","1","2","3","4","5","6","7","8","test2"))
4 > table(data$Subject,data$testid)
5 test1 1 2 3 4 5 6 7 8 test2
6 1 95 60 59 60 59 59 60 60 60 93
7 2 95 59 59 58 60 60 60 60 60 91
8 3 96 60 57 60 60 60 60 59 58 96
9 4 94 54 58 60 60 55 53 60 58 93
10 5 96 60 57 60 60 60 60 60 0 95
11 6 95 59 58 59 58 59 55 54 55 95
12 7 94 60 58 60 58 59 60 59 59 93
13 8 90 60 54 55 60 60 60 59 60 96
14 9 96 60 58 59 57 0 0 58 56 92
15 10 95 60 58 58 60 60 58 0 0 94
16 11 91 59 59 60 60 57 58 60 60 95
Summary Graphics
<img alt='sesssion2/graph2.png' src='-1' />
Summary Graphics
And another one.
<img alt='sesssion2/graph3.png' src='-1' />