regex - How to strip SGML tags from a text file using Python? -


i came across standard generalized markup language lately. have acquired corpus in sgml format emille/ciil corpus. documentation corpus:

emille corpus documentation

i want extract text present in file. encoding , markup information of corpus documentation is:

the text encoded two-byte unicode text. more information on unicode. texts marked in sgml using level 1 ces-compliant markup. each file includes full header, specifies provenance of text.

i having hard time stripping these tags. tried 'regular expression' 'beautiful soup' not working. sample text file. language want preserve punjabi.

sample text file

try following:

from bs4 import beautifulsoup import requests  # assuming url file html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/manual.htm').content  bsobj = beautifulsoup(html)  textdata = bsobj.findall('p')  item in textdata:     print item.get_text() 

hope looking for. if helps please vote , accept.


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -