python - PySpark: iterate inside small groups in DataFrame -
i trying understand how can operations inside small groups in pyspark dataframe. suppose have df following schema:
root |-- first_id: string (nullable = true) |-- second_id_struct: struct (nullable = true) | |-- s_id: string (nullable = true) | |-- s_id_2: int (nullable = true) |-- depth_from: float (nullable = true) |-- depth_to: float (nullable = true) |-- total_depth: float (nullable = true)
so data might this:
i to:
- group data first_id
- inside each group, order s_id_2 in ascending order
- append column
layer
either struct or root dataframe indicate order ofs_id_2
in group.
for example:
first_id | second_id | second_id_order ---------| --------- | --------------- a1 | [b, 10] | 1 ---------| --------- | --------------- a1 | [b, 14] | 2 ---------| --------- | --------------- a1 | [b, 22] | 3 ---------| --------- | --------------- a5 | [a, 1] | 1 ---------| --------- | --------------- a5 | [a, 7] | 2 ---------| --------- | --------------- a7 | null | 1 ---------| --------- | ---------------
once grouped each first_id
have @ 4 second_id_struct
. how approach kind of problems?
i particularly interested in how make iterative operations inside small groups (1-40 rows) of dataframes in general, order of columns inside group matters.
thanks!
create dataframe
d = [{'first_id': 'a1', 'second_id': ['b',10]}, {'first_id': 'a1', 'second_id': ['b',14]},{'first_id': 'a1', 'second_id': ['b',22]},{'first_id': 'a5', 'second_id': ['a',1]},{'first_id': 'a5', 'second_id': ['a',7]}] df = sqlcontext.createdataframe(d)
and can see structure
df.printschema() |-- first_id: string (nullable = true) |-- second_id: array (nullable = true) |........|-- element: string (containsnull = true) df.show() +--------+----------+ |first_id|second_id | +--------+----------+ | a1| [b, 10]| | a1| [b, 14]| | a1| [b, 22]| | a5| [a, 1]| | a5| [a, 7]| +--------+----------+
then can use dense_rank , window function show order in subgroup. same on partition in sql.
the introduction of window function: introducing window functions in spark sql
code here:
# setting window spec windowspec = window.partitionby('first_id').orderby(df.second_id[1]) # apply dense_rank window spec df.select(df.first_id, df.second_id, dense_rank().over(windowspec).alias("second_id_order")).show()
result:
+--------+---------+---------------+ |first_id|second_id|second_id_order| +--------+---------+---------------+ | a1| [b, 10]| 1| | a1| [b, 14]| 2| | a1| [b, 22]| 3| | a5| [a, 1]| 1| | a5| [a, 7]| 2| +--------+---------+---------------+
Comments
Post a Comment