python - Python2.7 - CSV DictReader -


i have question how read csv file can apply techniques exist in nltk. goal make csv file line line , not single line.

my first attempt with: file= open("data/myfile.csv"). .csv file has 40k+ rows. way realized purpose did not fit , changed to:

 import csv import preprocessing preprocessing import preprocessing def utf_8_encoder(unicode_csv_data):     line in unicode_csv_data:         yield line.encode('utf-8') def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):     # csv.py doesn't unicode; encode temporarily utf-8:     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),                             dialect=dialect, **kwargs)     row in csv_reader:         # decode utf-8 unicode, cell cell:         yield [unicode(cell, 'utf-8') cell in row]  open("data/myfile.csv", 'rb') csvfile:     #i had remove sniffer, because without indicating delimiters giving error did not find delimiters.     #dialect = csv.sniffer().sniff(csvfile.read(1024))     #csvfile.seek(0)     lower_stream = (line.lower() line in csvfile) #normalizing. putting text in tiny     #reading file     corpus = csv.dictreader(unicode_csv_reader(lower_stream), fieldnames='status_message',dialect ='excel')  def status_processing(corpus):      mycorpus = preprocessing.preprocessing()     mycorpus.text = corpus     mycorpus.initial_processing() 

fieldnames='status_message' field want read. status message header, used identify texts contained in csv

after that, start applying techniques make nltk easier use in text, 1 of them beautifulsoup.

the way displayed in def status_processing(corpus).

the invoked method of other script constructed this:

tokens = none     def initial_processing(self):         soup = beautifulsoup(self.text,"html.parser")         self.text = soup.get_text()         #todo se quiser salvar os links mudar aqui         self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text)         self.tokens = self.tokenizing(1, self.text)         pass 

this way, when run script error message displayed:

line 39, in initial_processing     soup = beautifulsoup(self.text,"html.parser")   file "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 176, in __init__     elif len(markup) <= 256: attributeerror: dictreader instance has no attribute '__len__' 

how can read csv file line line without there being error?


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -