Spark MLlib: Including categorical features -

March 15, 2010

what correct or best method including categorical variables (both string , int) feature mllib algorithm?

is correct use onehotencoders on categorical variables , include output columns other columns in vectorassembler in code below?

the reason end data frame rows looks feature3 , feature4 combined on same 'level' of importance 2 categorical features singly.

+------------------+-----------------------+---------------------------+ |prediction        |actualval |features                                | +------------------+-----------------------+---------------------------+ |355416.44924898935|990000.0  |(17,[0,1,2,3,4,5,10,15],[1.0,206.0])    | |358917.32988024893|210000.0  |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) | |291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) |

here code:

val indexer = new stringindexer()   .setinputcol("stringfeaturecode")   .setoutputcol("stringfeaturecodeindex")   .fit(data) val indexed = indexer.transform(data)  val encoder = new onehotencoder()   .setinputcol("stringfeaturecodeindex")   .setoutputcol("stringfeaturecodevec")  var encoded = encoder.transform(indexed)  encoded = encoded.withcolumn("intfeaturecodetmp", encoded.col("intfeaturecode")   .cast(doubletype))   .drop("intfeaturecode")   .withcolumnrenamed("intfeaturecodetmp", "intfeaturecode")  val intfeaturecodeencoder = new onehotencoder()   .setinputcol("intfeaturecode")   .setoutputcol("intfeaturecodevec")  encoded = intfeaturecodeencoder.transform(encoded)  val assemblerdeparture =   new vectorassembler()     .setinputcols(       array("stringfeaturecodevec", "intfeaturecodevec", "feature3", "feature4"))     .setoutputcol("features") var data2 = assemblerdeparture.transform(encoded)  val array(trainingdata, testdata) = data2.randomsplit(array(0.7, 0.3))  val rf = new randomforestregressor()   .setlabelcol("actualval")   .setfeaturescol("features")   .setnumtrees(100)

in general recommended method.
when working tree models unnecessary , should avoided. can use stringindexer only.

Search This Blog

CSS

Spark MLlib: Including categorical features -

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -