r - Generate bins from a data frame -
using python have created following data frame contains similarity values:
cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture jaccard 1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000 2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000 3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353 4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000 5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000 6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000 i trying write r script generate data frame reflects bins, condition of binning applies if value above 0.5 such
pseudocode:
if (cosinfcolor > 0.5 & cosinfcolor <= 0.6) bin = 1 if (cosinfcolor > 0.6 & cosinfcolor <= 0.7) bin = 2 if (cosinfcolor > 0.7 & cosinfcolor =< 0.8) bin = 3 if (cosinfcolor > 0.8 & cosinfcolor <=0.9) bin = 4 if (cosinfcolor > 0.9 & cosinfcolor <= 1.0) bin = 5 else bin = 0 based on above logic, want build data frame
cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture jaccard 1 3 0 0 1 1 0 0 how can start script, or should in python? trying familiar r after finding out how powerful is/number of machine learning packages has. goal build classifier first need familiar r :)
another cut answer takes account extrema:
dat <- read.table("clipboard", header=true) cuts <- apply(dat, 2, cut, c(-inf,seq(0.5, 1, 0.1), inf), labels=0:6) cuts[cuts=="6"] <- "0" cuts <- as.data.frame(cuts) cosinfcolor cosinedge cosintexture histofcolor histoedge histotexture jaccard 1 3 0 0 1 1 0 0 2 0 0 5 0 2 2 0 3 1 0 2 0 0 1 0 4 0 0 3 0 1 1 0 5 1 3 1 0 4 0 0 6 0 0 1 0 0 0 0 explanation
the cut function splits bins depending on cuts specify. let's take 1:10 , split @ 3, 5 , 7.
cut(1:10, c(3, 5, 7)) [1] <na> <na> <na> (3,5] (3,5] (5,7] (5,7] <na> <na> <na> levels: (3,5] (5,7] you can see how has made factor levels in between breaks. notice doesn't include 3 (there's include.lowest argument include it). these terrible names groups, let's call them group 1 , 2.
cut(1:10, c(3, 5, 7), labels=1:2) [1] <na> <na> <na> 1 1 2 2 <na> <na> <na> better, what's nas? outside our boundaries , not counted. count them, in solution, added -infinity , infinity, points included. notice have more breaks, we'll need more labels:
x <- cut(1:10, c(-inf, 3, 5, 7, inf), labels=1:4) [1] 1 1 1 2 2 3 3 4 4 4 levels: 1 2 3 4 ok, didn't want 4 (as per problem). wanted 4s in group 1. let's rid of entries labelled '4'.
x[x=="4"] <- "1" [1] 1 1 1 2 2 3 3 1 1 1 levels: 1 2 3 4 this different did before, notice took away last labels @ end before, i've done way here can better see how cut works.
ok, apply function. far, we've been using cut on single vector. want used on collection of vectors: each column of data frame. that's second argument of apply does. 1 applies function rows, 2 applies columns. apply cut function each column of data frame. after cut in apply function arguments cut, discussed above.
hope helps.
Comments
Post a Comment