spark pipeline vector assembler drop other columns -
a spark vectorassembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler produces following output
id | hour | mobile | userfeatures | clicked | features ----|------|--------|------------------|---------|----------------------------- 0 | 18 | 1.0 | [0.0, 10.0, 0.5] | 1.0 | [18.0, 1.0, 0.0, 10.0, 0.5] as can see last column contains previous features. better / more performant if other columns removed e.g. label/id , features retained or unnecessary overhead , feeding label/id , features estimator enough?
what happens when vectorassembler used in pipeline? last features used or introduce colinearity (duplicate columns) if original columns not removed manually?
please read documentation. every classifier parametrized features column (featurescol). doesn't consider other column or order of columns.
Comments
Post a Comment