python - PySpark: iterate inside small groups in DataFrame -


i trying understand how can operations inside small groups in pyspark dataframe. suppose have df following schema:

root |-- first_id: string (nullable = true) |-- second_id_struct: struct (nullable = true) |    |-- s_id: string (nullable = true) |    |-- s_id_2: int (nullable = true) |-- depth_from: float (nullable = true) |-- depth_to: float (nullable = true) |-- total_depth: float (nullable = true)  

so data might this:

i to:

  1. group data first_id
  2. inside each group, order s_id_2 in ascending order
  3. append column layer either struct or root dataframe indicate order of s_id_2 in group.

for example:

first_id | second_id | second_id_order  ---------| --------- | ---------------       a1 |   [b, 10] | 1   ---------| --------- | ---------------       a1 |   [b, 14] | 2 ---------| --------- | ---------------       a1 |   [b, 22] | 3 ---------| --------- | ---------------       a5 |    [a, 1] | 1 ---------| --------- | ---------------       a5 |    [a, 7] | 2 ---------| --------- | ---------------       a7 |      null | 1 ---------| --------- | ---------------         

once grouped each first_id have @ 4 second_id_struct. how approach kind of problems?

i particularly interested in how make iterative operations inside small groups (1-40 rows) of dataframes in general, order of columns inside group matters.

thanks!

create dataframe

d = [{'first_id': 'a1', 'second_id': ['b',10]}, {'first_id': 'a1', 'second_id': ['b',14]},{'first_id': 'a1', 'second_id': ['b',22]},{'first_id': 'a5', 'second_id': ['a',1]},{'first_id': 'a5', 'second_id': ['a',7]}]  df = sqlcontext.createdataframe(d)  

and can see structure

df.printschema()  |-- first_id: string (nullable = true) |-- second_id: array (nullable = true) |........|-- element: string (containsnull = true)  df.show() +--------+----------+ |first_id|second_id | +--------+----------+ |      a1|   [b, 10]| |      a1|   [b, 14]| |      a1|   [b, 22]| |      a5|    [a, 1]| |      a5|    [a, 7]| +--------+----------+ 

then can use dense_rank , window function show order in subgroup. same on partition in sql.

the introduction of window function: introducing window functions in spark sql

code here:

# setting window spec windowspec = window.partitionby('first_id').orderby(df.second_id[1]) # apply dense_rank window spec df.select(df.first_id, df.second_id, dense_rank().over(windowspec).alias("second_id_order")).show() 

result:

+--------+---------+---------------+ |first_id|second_id|second_id_order| +--------+---------+---------------+ |      a1|  [b, 10]|              1| |      a1|  [b, 14]|              2| |      a1|  [b, 22]|              3| |      a5|   [a, 1]|              1| |      a5|   [a, 7]|              2| +--------+---------+---------------+ 

Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -