r - Fastest way to map a new data frame column based on two other columns -

- April 15, 2012

i have data frame looks this:

id|value 01| 100 01| 101 01| 300 #edited case left out 02| 300 03| 100 03| 101 04| 100

and add new column looks @ both id , values assigned each id.

for example: if id has both value 100 , 101 add category a. if id has value of 300 add category b. if id has 1 value (either 100 or 101, not both) assign category c.

result:

id|value|category 01| 100 |  01| 101 |  01| 300 |  b #edited case left out 02| 300 |  b 03| 100 |  03| 101 |  04| 100 |  c

i understand can loop through , assign category, question whether there faster vectorized way?

a couple of options data.table

we number of elements per 'id' '100', '101' , add them together. output 0, 1, or 2 corresponding none, single element, or both present. can converted factor , change labels 'a' '2', 'b' '0' , 'c' '1'.

library(data.table) setdt(df2)[, indx:=sum(unique(value)==100)+sum(unique(value)==101),    id][, category:=factor(indx, levels=c(2,0,1), labels=letters[1:3]) ][,    indx:=null][] #    id value category #1:  1   100        #2:  1   101        #3:  2   300        b #4:  3   100        #5:  3   101        #6:  4   100        c

or create named vector ('v1') , use index map character elements (tostring(...)) grouped 'id'.

v1 <- c('100, 101' = 'a', '300'='b', '100'= 'c', '101'='c') setdt(df2)[, category := v1[tostring(sort(unique(value)))], by=id][] #    id value category #1:  1   100        #2:  1   101        #3:  2   300        b #4:  3   100        #5:  3   101        #6:  4   100        c

update

based on new dataset , new condition, can modify first solution

 setdt(df3)[, indx:= sum(unique(value)==100) + sum(unique(value)==101), id][,   category:= factor(indx, levels=c(2,0,1), labels=letters[1:3])][  value==300, category:='b'][, indx:=null][]  #    id value category  #1:  1   100         #2:  1   101         #3:  1   300        b  #4:  2   300        b  #5:  3   100         #6:  3   101         #7:  4   100        c

or using second option

  v1 <- c('100, 101' = 'a', '100, 101, 300' = 'a', '300'='b',             '100'= 'c', '101'='c')   setdt(df3)[, category := v1[tostring(sort(unique(value)))],                  by=id][value==300, category := 'b'][]   #   id value category   #1:  1   100          #2:  1   101          #3:  1   300        b   #4:  2   300        b   #5:  3   100          #6:  3   101          #7:  4   100        c

data

df2 <- structure(list(id = c(1l, 1l, 2l, 3l, 3l, 4l), value = c(100l,  101l, 300l, 100l, 101l, 100l)), .names = c("id", "value"),  row.names = c(na, -6l), class = "data.frame")  df3 <- structure(list(id = c(1l, 1l, 1l, 2l, 3l, 3l, 4l),  value = c(100l, 101l, 300l, 300l, 100l, 101l, 100l)), .names = c("id", "value"), class = "data.frame",  row.names = c(na, -7l))

Search This Blog

Migrate