apache spark - When to call `.value` for broadcasts in Pyspark? -
is 1 of following sub-optimal?
## first version ## def myfunc(val, listparam): return val in listparam.value # .value in function mylist_bc = sc.broadcast(mylist) rdd.map(lambda val:myfunc(val, mylist_bc)) ## second version ## def myfunc(val, listparam): return val in listparam mylist_bc = sc.broadcast(mylist) rdd.map(lambda val:myfunc(val, mylist_bc.value)) # .value outside function
is ok use second version broadcast function unaware of i'm using broadcasted value? thought maybe interfere broadcasting.
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
after broadcast variable created, should used instead of value v in functions run on cluster v not shipped nodes more once
i'd use option #1 - know executor use broadcast variable
option #2 might problematic , if value of broadcast variable calculated on driver, , sent regular variable executors
Comments
Post a Comment