Spark: query dataframe vs join -


spark 1.5. there static dataset may range hundred mb gb (here discard option of broadcasting dataset - memory needed). have spark streaming input want enrich data static dataset, providing common key (i understand can done using transform on dstream apply rdd/pairrdd logic). key cardinality high, on thousands.

here there options can see:

i can make full join, guess scale in terms of memory, pose problems in case of data having flow between nodes. understand may pay off partition both static , input rdds same key.

i considering though having data loaded in dataframe, , go querying every time input. of performance penalty? think not proper way use unless stream has low cardinality, right?

are assumptions correct? then, having full join partitioning preferred option?


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -