python how to count the number of words in html line by line -
i want peform simple tokenization count number of words in html line line, except words between <a>
tag , words between <a>
tag count individually
can nltk this? or there library can this?
for example : html code
<div class="side-article txt-article"> <p><strong>batam.tribunnews.com, bintan</strong> - tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="bintan">bintan</a>, senin (3/10/2016).</p> <p>empat perwira baru senin itu diminta cepat bekerja. tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p> <p>para pejabat tersebut yakni akp adi kuasa tarigan, kasat reskrim baru yang menggantikan akp arya tesa brahmana. arya pindah sebagai kabag ops di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> tanjungpinang.</p>
and want output be
wordscount : 0 linkwordscount : 0 wordscount : 21 linkwordscount : 2 wordscount : 19 linkwordscount : 0 wordscount : 25 linkwordscount : 2
wordscount number of words in each line except text between <a>
tag. , if there word appear twice count two. linkwordscount number of words in between <a>
tag.
so how make count line line except <a>
tag, , words between <a>
tag count individually.
thank you.
iterate on each line of raw html , search links in each line.
in example below, using naive way getting words count - split line spaces (this way -
counted word , batam.tribunnews.com
counts single word).
from bs4 import beautifulsoup html = """ <div class="side-article txt-article"> <p><strong>batam.tribunnews.com, bintan</strong> - tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="bintan">bintan</a>, senin (3/10/2016).</p> <p>empat perwira baru senin itu diminta cepat bekerja. tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p> <p>para pejabat tersebut yakni akp adi kuasa tarigan, kasat reskrim baru yang menggantikan akp arya tesa brahmana. arya pindah sebagai kabag ops di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> tanjungpinang.</p> """ soup = beautifulsoup(html.strip(), 'html.parser') line in html.strip().split('\n'): link_words = 0 line_soup = beautifulsoup(line.strip(), 'html.parser') link in line_soup.findall('a'): link_words += len(link.text.split()) # naive way words count words_count = len(line_soup.text.split()) print ('wordscount : {0} linkwordscount : {1}' .format(words_count, link_words))
output:
wordscount : 0 linkwordscount : 0 wordscount : 16 linkwordscount : 2 wordscount : 17 linkwordscount : 0 wordscount : 25 linkwordscount : 1
edit
if want read html file, use this:
with open(path_to_html_file, 'r') f: html = f.read()
Comments
Post a Comment