regex - How to strip SGML tags from a text file using Python? -

April 15, 2013

i came across standard generalized markup language lately. have acquired corpus in sgml format emille/ciil corpus. documentation corpus:

emille corpus documentation

i want extract text present in file. encoding , markup information of corpus documentation is:

the text encoded two-byte unicode text. more information on unicode. texts marked in sgml using level 1 ces-compliant markup. each file includes full header, specifies provenance of text.

i having hard time stripping these tags. tried 'regular expression' 'beautiful soup' not working. sample text file. language want preserve punjabi.

try following:

from bs4 import beautifulsoup import requests  # assuming url file html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/manual.htm').content  bsobj = beautifulsoup(html)  textdata = bsobj.findall('p')  item in textdata:     print item.get_text()

hope looking for. if helps please vote , accept.

Search This Blog

CSS

regex - How to strip SGML tags from a text file using Python? -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -