apache spark - Reducing a dataframe to the most frequent combinations of two columns -
i have json file import using following code:
spark = sparksession.builder.master("local").appname('gps').config(conf=sparkconf()).getorcreate() df = spark.read.json("sensordata.json") the result dataframe similar this:
+---+---+ | a| b| +---+---+ | 1| 3| | 2| 1| | 2| 3| | 1| 2| | 3| 1| | 1| 2| | 2| 1| | 1| 3| | 1| 2| +---+---+ my task using pyspark reduce data frequent combinations of 2 columns (a , b)
so wanted output this
+---+---+-----+ | a| b|count| +---+---+-----+ | 1| 2| 3| | 2| 1| 2| +---+---+-----+
you can combination of groupby , limit:
spark = sparksession.builder.master("local").appname('gps').config(conf=sparkconf()).getorcreate() df = spark.read.json("sensordata.json") df.groupby("a","b") .count() .sort("count",ascending = false) .limit(2) .show() +---+---+-----+ | a| b|count| +---+---+-----+ | 1| 2| 3| | 2| 1| 2| +---+---+-----+
Comments
Post a Comment