pyspark - Using Spark Pivot for get_dummies Substitute -
the following code sets dataframe in perfect format need have name columns "correctly."
df = spark.createdataframe([ (0, "x", "a"), (1, "z", "b"), (2, "x", "b"), (3, "x", "c"), (4, "y", "c"), (5, "y", "a") ], ["id","category", "other_thing"]) pivotdf = df.groupby("id").pivot("category").count() pivotdf.show() +---+----+----+----+ | id| x| y| z| +---+----+----+----+ | 0| 1|null|null| | 5|null| 1|null| | 1|null|null| 1| | 3| 1|null|null| | 2| 1|null|null| | 4|null| 1|null| +---+----+----+----+ i need output:
+---+-------------+-------------+-------------+ | id| category_x| category_y| category_z| +---+-------------+-------------+-------------+ | 0| 1 | null| null| | 5|null | 1| null| | 1|null | null| 1| | 3| 1 | null| null| | 2| 1 | null| null| | 4|null | 1| null| +---+-------------+-------------+-------------+ how can add column names programmatically (i.e., don't have manually type in "category" in case?
you can rename:
>>> pivot_col = "category" >>> pivotdf = df.groupby("id").pivot(pivot_col).count() >>> new_names = pivotdf.columns[:1] + \ ... ["{0}_{1}".format(pivot_col, c) c in pivotdf.columns[1:]] >>> pivotdf.todf(*new_names)
Comments
Post a Comment