scala - xgboost4j - spark evaluate requires RDD[(Double, Double)] -


i try use xgboost4j spark 2.0.1 , dataset api. far obtained predictions in following format using model.transform(testdata)

predictions.printschema root  |-- label: double (nullable = true)  |-- features: vector (nullable = true)  |-- probabilities: vector (nullable = true)  |-- prediction: double (nullable = true)   +-----+--------------------+--------------------+----------+ |label|            features|       probabilities|prediction| +-----+--------------------+--------------------+----------+ |  0.0|[0.0,1.0,0.0,476....|[0.96766251325607...|       0.0| |  0.0|[0.0,1.0,0.0,642....|[0.99599152803421...|       0.0| 

but generate evaluation metrics. how can map predictions right format? xgboost-4j dmlc on spark-1.6.1 propose similar problem, not work me.

val metrics = new binaryclassificationmetrics(predictions.select("prediction", "label").rdd) require rdd[(double, double)]  

instead of predictions.select("prediction", "label") looks like

root  |-- label: double (nullable = true)  |-- prediction: double (nullable = true) 

tryping map required tuple like:

predictions.select("prediction", "label").map{case row(_) => (_,_)} 

fails work well.

edit

reading bit more in sparks documentation found http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.binaryclassificationevaluator supports ml instead of ml-lib e.g. datasets. far not integrate xgboost4j in pipeline.

here example https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/sparkmodeltuningtool.scala how use xgboost4j in spark pipeline. in fact, have xgboostestimator plays in pipeline.


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

depending on nth recurrence of job in control M -

asp.net - Problems sending emails from forum -