python - Python2.7 - CSV DictReader -
i have question how read csv file can apply techniques exist in nltk. goal make csv file line line , not single line.
my first attempt with: file= open("data/myfile.csv")
. .csv file has 40k+ rows. way realized purpose did not fit , changed to:
import csv import preprocessing preprocessing import preprocessing def utf_8_encoder(unicode_csv_data): line in unicode_csv_data: yield line.encode('utf-8') def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't unicode; encode temporarily utf-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) row in csv_reader: # decode utf-8 unicode, cell cell: yield [unicode(cell, 'utf-8') cell in row] open("data/myfile.csv", 'rb') csvfile: #i had remove sniffer, because without indicating delimiters giving error did not find delimiters. #dialect = csv.sniffer().sniff(csvfile.read(1024)) #csvfile.seek(0) lower_stream = (line.lower() line in csvfile) #normalizing. putting text in tiny #reading file corpus = csv.dictreader(unicode_csv_reader(lower_stream), fieldnames='status_message',dialect ='excel') def status_processing(corpus): mycorpus = preprocessing.preprocessing() mycorpus.text = corpus mycorpus.initial_processing()
fieldnames='status_message'
field want read. status message header, used identify texts contained in csv
after that, start applying techniques make nltk easier use in text, 1 of them beautifulsoup.
the way displayed in def status_processing(corpus)
.
the invoked method of other script constructed this:
tokens = none def initial_processing(self): soup = beautifulsoup(self.text,"html.parser") self.text = soup.get_text() #todo se quiser salvar os links mudar aqui self.text = re.sub(r'(http://|https://|www.)[^"\' ]+', " ", self.text) self.tokens = self.tokenizing(1, self.text) pass
this way, when run script error message displayed:
line 39, in initial_processing soup = beautifulsoup(self.text,"html.parser") file "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 176, in __init__ elif len(markup) <= 256: attributeerror: dictreader instance has no attribute '__len__'
how can read csv file line line without there being error?
Comments
Post a Comment