spark pipeline vector assembler drop other columns -


a spark vectorassembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces following output

id | hour | mobile | userfeatures     | clicked | features ----|------|--------|------------------|---------|-----------------------------  0  | 18   | 1.0    | [0.0, 10.0, 0.5] | 1.0     | [18.0, 1.0, 0.0, 10.0, 0.5] 

as can see last column contains previous features. better / more performant if other columns removed e.g. label/id , features retained or unnecessary overhead , feeding label/id , features estimator enough?

what happens when vectorassembler used in pipeline? last features used or introduce colinearity (duplicate columns) if original columns not removed manually?

please read documentation. every classifier parametrized features column (featurescol). doesn't consider other column or order of columns.


Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -