python how to count the number of words in html line by line -


i want peform simple tokenization count number of words in html line line, except words between <a> tag , words between <a> tag count individually

can nltk this? or there library can this?

for example : html code

<div class="side-article txt-article"> <p><strong>batam.tribunnews.com, bintan</strong> - tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="bintan">bintan</a>, senin (3/10/2016).</p> <p>empat perwira baru senin itu diminta cepat bekerja. tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p> <p>para pejabat tersebut yakni akp adi kuasa tarigan, kasat reskrim baru yang menggantikan akp arya tesa brahmana. arya pindah sebagai kabag ops di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> tanjungpinang.</p> 

and want output be

wordscount : 0 linkwordscount : 0 wordscount : 21 linkwordscount : 2 wordscount : 19 linkwordscount : 0 wordscount : 25 linkwordscount : 2 

wordscount number of words in each line except text between <a> tag. , if there word appear twice count two. linkwordscount number of words in between <a> tag.

so how make count line line except <a> tag, , words between <a> tag count individually.

thank you.

iterate on each line of raw html , search links in each line.

in example below, using naive way getting words count - split line spaces (this way - counted word , batam.tribunnews.com counts single word).

from bs4 import beautifulsoup  html = """ <div class="side-article txt-article"> <p><strong>batam.tribunnews.com, bintan</strong> - tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="bintan">bintan</a>, senin (3/10/2016).</p> <p>empat perwira baru senin itu diminta cepat bekerja. tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p> <p>para pejabat tersebut yakni akp adi kuasa tarigan, kasat reskrim baru yang menggantikan akp arya tesa brahmana. arya pindah sebagai kabag ops di <a href="http://batam.tribunnews.com/tag/polres/" title="polres">polres</a> tanjungpinang.</p> """  soup = beautifulsoup(html.strip(), 'html.parser')  line in html.strip().split('\n'):     link_words = 0      line_soup = beautifulsoup(line.strip(), 'html.parser')     link in line_soup.findall('a'):         link_words += len(link.text.split())      # naive way words count     words_count = len(line_soup.text.split())     print ('wordscount : {0} linkwordscount : {1}'            .format(words_count, link_words)) 

output:

wordscount : 0 linkwordscount : 0 wordscount : 16 linkwordscount : 2 wordscount : 17 linkwordscount : 0 wordscount : 25 linkwordscount : 1 

edit

if want read html file, use this:

with open(path_to_html_file, 'r') f:     html = f.read() 

Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -