Spark MLlib: Including categorical features -
what correct or best method including categorical variables (both string , int) feature mllib algorithm?
is correct use onehotencoders on categorical variables , include output columns other columns in vectorassembler in code below?
the reason end data frame rows looks feature3 , feature4 combined on same 'level' of importance 2 categorical features singly.
+------------------+-----------------------+---------------------------+ |prediction |actualval |features | +------------------+-----------------------+---------------------------+ |355416.44924898935|990000.0 |(17,[0,1,2,3,4,5,10,15],[1.0,206.0]) | |358917.32988024893|210000.0 |(17,[0,1,2,3,4,5,10,15,16],[1.0,172.0]) | |291313.84175674635|4600000.0 |(17,[0,1,2,3,4,5,12,15,16],[1.0,239.0]) | here code:
val indexer = new stringindexer() .setinputcol("stringfeaturecode") .setoutputcol("stringfeaturecodeindex") .fit(data) val indexed = indexer.transform(data) val encoder = new onehotencoder() .setinputcol("stringfeaturecodeindex") .setoutputcol("stringfeaturecodevec") var encoded = encoder.transform(indexed) encoded = encoded.withcolumn("intfeaturecodetmp", encoded.col("intfeaturecode") .cast(doubletype)) .drop("intfeaturecode") .withcolumnrenamed("intfeaturecodetmp", "intfeaturecode") val intfeaturecodeencoder = new onehotencoder() .setinputcol("intfeaturecode") .setoutputcol("intfeaturecodevec") encoded = intfeaturecodeencoder.transform(encoded) val assemblerdeparture = new vectorassembler() .setinputcols( array("stringfeaturecodevec", "intfeaturecodevec", "feature3", "feature4")) .setoutputcol("features") var data2 = assemblerdeparture.transform(encoded) val array(trainingdata, testdata) = data2.randomsplit(array(0.7, 0.3)) val rf = new randomforestregressor() .setlabelcol("actualval") .setfeaturescol("features") .setnumtrees(100)
- in general recommended method.
- when working tree models unnecessary , should avoided. can use
stringindexeronly.
Comments
Post a Comment