regex - How to strip SGML tags from a text file using Python? -
i came across standard generalized markup language lately. have acquired corpus in sgml format emille/ciil corpus. documentation corpus:
i want extract text present in file. encoding , markup information of corpus documentation is:
the text encoded two-byte unicode text. more information on unicode. texts marked in sgml using level 1 ces-compliant markup. each file includes full header, specifies provenance of text.
i having hard time stripping these tags. tried 'regular expression' 'beautiful soup' not working. sample text file. language want preserve punjabi.
try following:
from bs4 import beautifulsoup import requests # assuming url file html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/manual.htm').content bsobj = beautifulsoup(html) textdata = bsobj.findall('p') item in textdata: print item.get_text()
hope looking for. if helps please vote , accept.
Comments
Post a Comment