dataset - Cartesian product of only one column in Pyspark? -
as specified in question want cartesian of 1 column of rdd. example:
rdd1: id1 id2 b id3 c
my output should be:
id1 a id1 b id1 c id2 b id2 b b id2 b c id3 c id3 c b id3 c c
you can creating new rdd second column, rdd2=rdd.map(lambda l: l[1])
. cartesian
of these 2 rdds:
rdd.cartesian(rdd2).map(lambda v: (v[0][0],v[0][1],v[1]))
the map
there because cartesian
return rows ((id1,a),a)
, , map
converts (id1,a,a)
Comments
Post a Comment