apache spark - How to open a file which is stored in HDFS in pySpark using with open -


how open file stored in hdfs - here input file hdfs - if give file bellow , wont able open , show file not found

from pyspark import sparkconf,sparkcontext conf = sparkconf () sc = sparkcontext(conf = conf) def getmoviename():     movienames = {}     open ("/user/sachinkerala6174/indata/moviestat") f:         line in f:             fields = line.split("|")             mid = fields[0]             mname = fields[1]             movienames[int(fields[0])] = fields[1]             return movienames namedict = sc.broadcast(getmoviename()) 

my assumption use

with open (sc.textfile("/user/sachinkerala6174/indata/moviestat")) f: 

but didnt work

to read textfile rdd:

rdd_name = sc.textfile("/user/sachinkerala6174/indata/moviestat") 

you can use "collect()" in order use in pure python (not recommended - use on small data), or use spark rdd methods in order manipulate using pyspark methods (the recommended way)

for more info: http://spark.apache.org/docs/2.0.1/api/python/pyspark.html

textfile(name, minpartitions=none, use_unicode=true)

read text file hdfs, local file system (available on nodes), or hadoop-supported file system uri, , return rdd of strings.

if use_unicode false, strings kept str (encoding utf-8), faster , smaller unicode. (added in spark 1.2)

>>> path = os.path.join(tempdir, "sample-text.txt") >>> open(path, "w") testfile: ...    _ = testfile.write("hello world!") >>> textfile = sc.textfile(path) >>> textfile.collect() [u'hello world!'] 

Comments

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -