pyspark - Using Spark Pivot for get_dummies Substitute -


the following code sets dataframe in perfect format need have name columns "correctly."

df = spark.createdataframe([     (0, "x", "a"),     (1, "z", "b"),     (2, "x", "b"),     (3, "x", "c"),     (4, "y", "c"),     (5, "y", "a") ], ["id","category", "other_thing"])  pivotdf = df.groupby("id").pivot("category").count()  pivotdf.show()  +---+----+----+----+ | id|   x|   y|   z| +---+----+----+----+ |  0|   1|null|null| |  5|null|   1|null| |  1|null|null|   1| |  3|   1|null|null| |  2|   1|null|null| |  4|null|   1|null| +---+----+----+----+ 

i need output:

+---+-------------+-------------+-------------+ | id|   category_x|   category_y|   category_z| +---+-------------+-------------+-------------+ |  0|   1         |         null|         null| |  5|null         |            1|         null| |  1|null         |         null|            1| |  3|   1         |         null|         null| |  2|   1         |         null|         null| |  4|null         |            1|         null| +---+-------------+-------------+-------------+ 

how can add column names programmatically (i.e., don't have manually type in "category" in case?

you can rename:

>>> pivot_col = "category" >>> pivotdf = df.groupby("id").pivot(pivot_col).count() >>> new_names = pivotdf.columns[:1] + \ ...   ["{0}_{1}".format(pivot_col, c) c in pivotdf.columns[1:]] >>>  pivotdf.todf(*new_names) 

Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -