Python Forum
How do I get rid of the HTML tags in my output? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: How do I get rid of the HTML tags in my output? (/thread-20336.html)



How do I get rid of the HTML tags in my output? - glittergirl - Aug-05-2019

How do I remove HTML tags from the following code? This is what I've tried:

import collections
import itertools
import sys
import csv
import glob
import re

def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

for filepath in glob.glob('**/*.html', recursive=True):
	with open(filepath) as f:
	    before = collections.deque(maxlen=10)
	    for line in f:
	        if 'apple' in line:
	            sys.stdout.writelines(before)
	            sys.stdout.write(line)
	            sys.stdout.writelines(itertools.islice(f, 10))
	            break
	        results = before.append(line)
blah = striphtml(results)
print(blah)
The printed code still has HTML tags in it. I don't have to do it in regex; whatever is easiest should be fine.


RE: How do I get rid of the HTML tags in my output? - snippsat - Aug-05-2019

Use html2text,look at this post.
Also beware that it will not always look good,people who make html has never just all text output in mind.