Monday, October 10, 2011

The grep() command for wild card searches

A cool way to search through your data is using the grep() command. grep is a "natural language" tool in R that lets you do wildcard searches for particular sequences of characters or words. I don't know the lingo or how or use it very well, but its not too hard to get some usefulness out of this command.

My understanding is that using natural language tools are a big part of intelligent computer programming. For data analysis, you could use it to pick out typos, ID #s that fit a particular pattern, multi-character factor variables, and probably many other creative applications. If you do mark-recapture analysis in program MARK this could be a useful way to identify capture histories that fit a pattern of interest.

Here's what I've used grep for. In my data set I have plants that have been observed each year from 2003 until 2007. I've made a column of data that compiles that "capture history" of each plant. 0 means that the plant wasn't there and 1 means it was there. These plants often spend a year underground in some kind of dormant state and we're trying to understand the biology of this phenomena.

I want to search my "capture history" column to identify all the plants that were dormant for one year and then grew at least two years in a row. This is so I can determine how quickly a plant grows the first few years after its been dormant. There are many possible combinations that fit my criteria since the capture histories have 5 entries
0 1 1 1 1
0 1 1 1 0
0 1 1 0 0
0 1 0 1 1
0 1 1 0 1
and so on.

To find all the plants that meet my criteria I used the grep() command like this

pattern_011<-grep("0 1 1",capture_history)

grep extracts all values that have the pattern "0 1 1" regardless of what comes before the 0 or after the last 1. The result is a list of row IDs for plants that fit the pattern.

No comments:

Post a Comment