apache spark - How to open a file which is stored in HDFS in pySpark using with open -


how open file stored in hdfs - here input file hdfs - if give file bellow , wont able open , show file not found

from pyspark import sparkconf,sparkcontext conf = sparkconf () sc = sparkcontext(conf = conf) def getmoviename():     movienames = {}     open ("/user/sachinkerala6174/indata/moviestat") f:         line in f:             fields = line.split("|")             mid = fields[0]             mname = fields[1]             movienames[int(fields[0])] = fields[1]             return movienames namedict = sc.broadcast(getmoviename()) 

my assumption use

with open (sc.textfile("/user/sachinkerala6174/indata/moviestat")) f: 

but didnt work

to read textfile rdd:

rdd_name = sc.textfile("/user/sachinkerala6174/indata/moviestat") 

you can use "collect()" in order use in pure python (not recommended - use on small data), or use spark rdd methods in order manipulate using pyspark methods (the recommended way)

for more info: http://spark.apache.org/docs/2.0.1/api/python/pyspark.html

textfile(name, minpartitions=none, use_unicode=true)

read text file hdfs, local file system (available on nodes), or hadoop-supported file system uri, , return rdd of strings.

if use_unicode false, strings kept str (encoding utf-8), faster , smaller unicode. (added in spark 1.2)

>>> path = os.path.join(tempdir, "sample-text.txt") >>> open(path, "w") testfile: ...    _ = testfile.write("hello world!") >>> textfile = sc.textfile(path) >>> textfile.collect() [u'hello world!'] 

Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -