dataset - Cartesian product of only one column in Pyspark? -


as specified in question want cartesian of 1 column of rdd. example:

rdd1: id1  id2  b id3  c 

my output should be:

id1 a id1 b id1 c id2 b id2 b b id2 b c id3 c id3 c b id3 c c 

you can creating new rdd second column, rdd2=rdd.map(lambda l: l[1]). cartesian of these 2 rdds:

rdd.cartesian(rdd2).map(lambda v: (v[0][0],v[0][1],v[1]))

the map there because cartesian return rows ((id1,a),a), , mapconverts (id1,a,a)


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -