Sunday, October 23, 2011

"Webscraping" with R

Here are four cool exercises for learning how to lift data from websites with R.

FIRST, from theBioBucket: "A Little Webscraping Exercise" that involves extracting blog addresses from r-bloggers.com. The code first identifies all "ul" html tags, which indicate unordered lists. This list is then screened the tags that bracket the words "Contributing Blogs." The blog addresses are then extracted and cleaned up using grep(), strsplig(), and unlist().

The code on the blog is not transparent to newbie, but you can break up the commands into little bits and figure out what's going on. For example, a key line of code is:

blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))

To figure out what the heck this is doing, just break it up into its nested components.

#The strsplit() command
strsplit(blog_list_1, "\"")

This splits the code for each blog into a set of elements in a list

#the lapply() command
lapply(strsplit(blog_list_1, "\""), "[[", 2)

#example output, in the form of a list
[[274]]
[1] "http:/ /rappster.wordpress.com"

#The unlist() command, in the form of a vector. (This command literally "un-lists" a list into another form:
unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))

#example output
[271] "ht tp://yusung.blogspot.com/search/label/R"
[272] "ht tp://www.drewconway.com/zia"
[273] "ht tp://rtricks.wordpress.com"
[274] "ht tp://rappster.wordpress.com"


SECOND, from another blogger from r-bloggers.com: "
How to buy a used car with R", which begins with scraping data from the Kelly Bluebook webpage.

THIRD, this guy's code looks like it extracts every single word from a blog (in this case, his own rather small one) and creates a "word cloud" from it. Also spotted on R-bloggers.com - groovy.

FOURTH, (not free, but only $5) is the short e-book "Data mashups in R", which involves scrapping webdata, converting addresses to GPS coordinates with a Yahoo! map function, working with XML files, and making maps. Unfortunately Yahoo! has changed some of its tools and its not clear (at least to a novice) how to get the authorization to query the Yahoo! map and address database.

No comments:

Post a Comment