apache spark - When to call `.value` for broadcasts in Pyspark? -


is 1 of following sub-optimal?

    ## first version ##     def myfunc(val, listparam):         return val in listparam.value  # .value in function      mylist_bc = sc.broadcast(mylist)     rdd.map(lambda val:myfunc(val, mylist_bc))       ## second version ##     def myfunc(val, listparam):         return val in listparam      mylist_bc = sc.broadcast(mylist)     rdd.map(lambda val:myfunc(val, mylist_bc.value))  # .value outside function 

is ok use second version broadcast function unaware of i'm using broadcasted value? thought maybe interfere broadcasting.

http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

after broadcast variable created, should used instead of value v in functions run on cluster v not shipped nodes more once

i'd use option #1 - know executor use broadcast variable

option #2 might problematic , if value of broadcast variable calculated on driver, , sent regular variable executors


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -