Dynamic subsets of data frames in R
I thought it would be cool to be able to define some subsets of a data frame without having to specify the query each time, but still have them update dynamically every time the data frame changes. Like views in a database.
Here’s the solution I’m using. The example application is parsing an Apache server log file.
require(plyr)
log <- read.table(file='httpd.combine.20120509')
# in the file I used, there was a space between the time and the time zone, creating two fields.
names(log) <- c('host', 'identity', 'user', 'time' ,'V5','request', 'status', 'bytes','referer','agent')
# Paste the two fields together
log$time <- paste(log$time, log$V5, sep=' ')
# remove the extra field
log$V5 <- NULL
# extract the URIs from the request field
log$uri <- gsub('GET |PROPFIND |HEAD |OPTIONS | HTTP/*.*','',log$request)
# convert the timestamp to something R can work with
log$rtime <- strptime(log$time, '[%d/%B/%Y:%H:%M:%S')
# Identify the obvious bots from the agent field
log$isbot <- grepl (".*bot.*", log$agent)
views <- function(x, subset.code) {
switch(subset.code,
'nobots' = subset(x, isbot==TRUE),
'pdfs' = subset(x, grepl('.*.pdf$', x$uri))
)
}
views(log, 'pdfs')
That views function is where the action is. Each subset is specified and assigned a nickname. For example, nobots shows only records that don’t contain the string “bot” in the user-agent, and pdfs shows all the pdfs that were requested.
Rather than copy/pasting subset(log, grepl('.*.pdf$', log$uri)) whenever you want to work with the pdfs, you just write views(log, 'pdfs'). Not a huge difference, but for more complex subset queries it comes in handy.
UPDATE: Yes, I know I’ve included the require(plyr) line even though the demonstration code doesn’t use it. You’re free to run R without plyr loaded. You’re also free to not put on pants.





![theatlantic:
Why Is General McChrystal Teaching an Off-the-Record Course at Yale?
McChrystal, who formerly led special operations forces in Iraq and Afghanistan and later became a senior American commander in Afghanistan, now teaches a class at Yale’s Grand Strategy Program, where he integrates his military experience with his studies on leadership. In the New York Times, McCyrstal is quoted as saying “the only reason I’m here to teach,” compared with “somebody who’s got a Ph.D., is because I’ve been through it.”
McChrystal must have been through something ominous because, according to Elisabeth Bumiller’s Times article, Yale University imposes restrictions on students who sit in McChrystal’s classes, demanding that they take notes on an “off the record” basis — i.e., not for attribution.
Read more. [Image: Reuters]
Easy prediction: off-the record access to celebrity “instructors” is going to be a popular way for expensive universities to distinguish themselves from their competitors.](http://25.media.tumblr.com/tumblr_m4jeb92KSy1qcokc4o1_500.jpg)


