apache spark - Reducing a dataframe to the most frequent combinations of two columns -

May 15, 2015

i have json file import using following code:

spark = sparksession.builder.master("local").appname('gps').config(conf=sparkconf()).getorcreate() df = spark.read.json("sensordata.json")

the result dataframe similar this:

+---+---+ |  a|  b| +---+---+ |  1|  3| |  2|  1| |  2|  3| |  1|  2| |  3|  1| |  1|  2| |  2|  1| |  1|  3| |  1|  2| +---+---+

my task using pyspark reduce data frequent combinations of 2 columns (a , b)

so wanted output this

+---+---+-----+ |  a|  b|count| +---+---+-----+ |  1|  2|    3| |  2|  1|    2| +---+---+-----+

you can combination of groupby , limit:

spark = sparksession.builder.master("local").appname('gps').config(conf=sparkconf()).getorcreate() df = spark.read.json("sensordata.json")  df.groupby("a","b")   .count()   .sort("count",ascending = false)   .limit(2)   .show() +---+---+-----+ |  a|  b|count| +---+---+-----+ |  1|  2|    3| |  2|  1|    2| +---+---+-----+

Search This Blog

CSS

apache spark - Reducing a dataframe to the most frequent combinations of two columns -

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -