Spark: query dataframe vs join -
spark 1.5. there static dataset may range hundred mb gb (here discard option of broadcasting dataset - memory needed). have spark streaming input want enrich data static dataset, providing common key (i understand can done using transform on dstream apply rdd/pairrdd logic). key cardinality high, on thousands.
here there options can see:
i can make full join, guess scale in terms of memory, pose problems in case of data having flow between nodes. understand may pay off partition both static , input rdds same key.
i considering though having data loaded in dataframe, , go querying every time input. of performance penalty? think not proper way use unless stream has low cardinality, right?
are assumptions correct? then, having full join partitioning preferred option?
Comments
Post a Comment