Spark: query dataframe vs join -

January 15, 2012

spark 1.5. there static dataset may range hundred mb gb (here discard option of broadcasting dataset - memory needed). have spark streaming input want enrich data static dataset, providing common key (i understand can done using transform on dstream apply rdd/pairrdd logic). key cardinality high, on thousands.

here there options can see:

i can make full join, guess scale in terms of memory, pose problems in case of data having flow between nodes. understand may pay off partition both static , input rdds same key.

i considering though having data loaded in dataframe, , go querying every time input. of performance penalty? think not proper way use unless stream has low cardinality, right?

are assumptions correct? then, having full join partitioning preferred option?

Search This Blog

CSS

Spark: query dataframe vs join -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -