r - Fastest way to map a new data frame column based on two other columns -
i have data frame looks this:
id|value 01| 100 01| 101 01| 300 #edited case left out 02| 300 03| 100 03| 101 04| 100   and add new column looks @ both id , values assigned each id.
for example: if id has both value 100 , 101 add category a. if id has value of 300 add category b. if id has 1 value (either 100 or 101, not both) assign category c.
result:
id|value|category 01| 100 |  01| 101 |  01| 300 |  b #edited case left out 02| 300 |  b 03| 100 |  03| 101 |  04| 100 |  c   i understand can loop through , assign category, question whether there faster vectorized way?
a couple of options data.table
we number of elements per 'id' '100', '101' , add them together.  output 0, 1, or 2 corresponding none, single element, or both present.  can converted factor , change labels 'a' '2', 'b' '0' , 'c' '1'.
library(data.table) setdt(df2)[, indx:=sum(unique(value)==100)+sum(unique(value)==101),    id][, category:=factor(indx, levels=c(2,0,1), labels=letters[1:3]) ][,    indx:=null][] #    id value category #1:  1   100        #2:  1   101        #3:  2   300        b #4:  3   100        #5:  3   101        #6:  4   100        c   or create named vector ('v1') , use index map character elements (tostring(...)) grouped 'id'.
v1 <- c('100, 101' = 'a', '300'='b', '100'= 'c', '101'='c') setdt(df2)[, category := v1[tostring(sort(unique(value)))], by=id][] #    id value category #1:  1   100        #2:  1   101        #3:  2   300        b #4:  3   100        #5:  3   101        #6:  4   100        c   update
based on new dataset , new condition, can modify first solution
 setdt(df3)[, indx:= sum(unique(value)==100) + sum(unique(value)==101), id][,   category:= factor(indx, levels=c(2,0,1), labels=letters[1:3])][  value==300, category:='b'][, indx:=null][]  #    id value category  #1:  1   100         #2:  1   101         #3:  1   300        b  #4:  2   300        b  #5:  3   100         #6:  3   101         #7:  4   100        c   or using second option
  v1 <- c('100, 101' = 'a', '100, 101, 300' = 'a', '300'='b',             '100'= 'c', '101'='c')   setdt(df3)[, category := v1[tostring(sort(unique(value)))],                  by=id][value==300, category := 'b'][]   #   id value category   #1:  1   100          #2:  1   101          #3:  1   300        b   #4:  2   300        b   #5:  3   100          #6:  3   101          #7:  4   100        c   data
df2 <- structure(list(id = c(1l, 1l, 2l, 3l, 3l, 4l), value = c(100l,  101l, 300l, 100l, 101l, 100l)), .names = c("id", "value"),  row.names = c(na, -6l), class = "data.frame")  df3 <- structure(list(id = c(1l, 1l, 1l, 2l, 3l, 3l, 4l),  value = c(100l, 101l, 300l, 300l, 100l, 101l, 100l)), .names = c("id", "value"), class = "data.frame",  row.names = c(na, -7l))      
Comments
Post a Comment