Tuesday, October 18, 2011

NAs: little black holes in a dataframe


I keep running into problems with converting NAs and working with them in various situations. I'm trying to understand how R treats them to better understand how to convert them to factors or numbers.

I often want to think of NAs as literally an "N" and an "A" in a dataframe. They act, however, like some kind of non-object. On page 149 of Software for Data Analysis (2008) Chambers says that NAs are "interpreted as an undefined value suitable for the type of the vector" they reside in. So, NAs have no value, but they do have a "class."

The idea that they have no value, however, is significant, since the character "N" and "A" do have a value and can be identified by equalities (such is "==") and regular expressions (eg "grep"). An NA, however, has no value, and so a statement like " ifelse( column.X == NA ... )", which I am always trying to implement to clear out NAs from a dataframe, will never work, no matter how hard you try. Neither will "ifelse(column.X == "NA").

I have found that using "as.raw" converts NAs to literal "N" plus "A", but this is a rather inelegant way to do things.

The correct way to convert NAs is to use "exclude = NULL" in a call to "factor( )". This will create a factor level called "", stated in the R help, "For a numeric x, set exclude=NULL to make NA an extra level (prints as ); by default, this is the last level" (italics mine).

NAs can also be removed from an existing dataframe by creating an index of of the NA values and then replacing those values.

For example, you can use which( ) to find the row numbers of NA values. If "out" is a vector, the indices of NA values are identified via:

idx=which(is.na(out))

The NA values can be replaced by another value by

out[idx] <- (replacement values).

A complete example would be:

Let's work vector called "out" with data and NAs.

out <- c(NA,1.2,11.3,0.01, NA, 12, 1.2, NA)

We'll say we want to replace the NAs with the mean of the vector. The mean of the vector is calculated as

out.mean <- mean(out, na.rm = TRUE)

Then, the index values for the NAs are found, as note above, with

idx=which(is.na(out))

the object "idx" contains the row numbers for the NAs

> idx
[1] 1 5 6 8

The NAs can be replaced with the mean with

out[idx] <- out.mean

The object "out" now has no NAs

> idx
[1] 3.4275 1.2000 11.3000 0.0100 3.4275 3.4275 1.2000 3.4275

You can also replace the NAs with the replace( ) command:
out <- replace(out, idx, out.mean)



NA vs NaN
On pages 194 ff, Chambers points out the difference between NA and NaN: NaN arise during numerical calculations.

Additional information NA:
factor( ) on CRAN

No comments:

Post a Comment