r - Generate bins from a data frame -

March 15, 2013

using python have created following data frame contains similarity values:

  cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture    jaccard 1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000 2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000 3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353 4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000 5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000 6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000

i trying write r script generate data frame reflects bins, condition of binning applies if value above 0.5 such

pseudocode:

if (cosinfcolor > 0.5 & cosinfcolor <= 0.6)    bin = 1 if (cosinfcolor > 0.6 & cosinfcolor <= 0.7)    bin = 2 if (cosinfcolor > 0.7 & cosinfcolor =< 0.8)    bin = 3 if (cosinfcolor > 0.8 & cosinfcolor <=0.9)    bin = 4 if (cosinfcolor > 0.9 & cosinfcolor <= 1.0)    bin = 5 else    bin = 0

based on above logic, want build data frame

  cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture    jaccard 1       3         0         0            1           1        0               0

how can start script, or should in python? trying familiar r after finding out how powerful is/number of machine learning packages has. goal build classifier first need familiar r :)

another cut answer takes account extrema:

dat <- read.table("clipboard", header=true)  cuts <- apply(dat, 2, cut, c(-inf,seq(0.5, 1, 0.1), inf), labels=0:6) cuts[cuts=="6"] <- "0" cuts <- as.data.frame(cuts)    cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture jaccard 1           3         0            0           1         1            0       0 2           0         0            5           0         2            2       0 3           1         0            2           0         0            1       0 4           0         0            3           0         1            1       0 5           1         3            1           0         4            0       0 6           0         0            1           0         0            0       0

explanation

the cut function splits bins depending on cuts specify. let's take 1:10 , split @ 3, 5 , 7.

cut(1:10, c(3, 5, 7))  [1] <na>  <na>  <na>  (3,5] (3,5] (5,7] (5,7] <na>  <na>  <na>  levels: (3,5] (5,7]

you can see how has made factor levels in between breaks. notice doesn't include 3 (there's include.lowest argument include it). these terrible names groups, let's call them group 1 , 2.

cut(1:10, c(3, 5, 7), labels=1:2)  [1] <na> <na> <na> 1    1    2    2    <na> <na> <na>

better, what's nas? outside our boundaries , not counted. count them, in solution, added -infinity , infinity, points included. notice have more breaks, we'll need more labels:

x <- cut(1:10, c(-inf, 3, 5, 7, inf), labels=1:4)  [1] 1 1 1 2 2 3 3 4 4 4 levels: 1 2 3 4

ok, didn't want 4 (as per problem). wanted 4s in group 1. let's rid of entries labelled '4'.

x[x=="4"] <- "1"  [1] 1 1 1 2 2 3 3 1 1 1 levels: 1 2 3 4

this different did before, notice took away last labels @ end before, i've done way here can better see how cut works.

ok, apply function. far, we've been using cut on single vector. want used on collection of vectors: each column of data frame. that's second argument of apply does. 1 applies function rows, 2 applies columns. apply cut function each column of data frame. after cut in apply function arguments cut, discussed above.

hope helps.

Search This Blog

CSS

r - Generate bins from a data frame -

explanation

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -