Python Forum

Full Version: How do I get rid of the HTML tags in my output?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
How do I remove HTML tags from the following code? This is what I've tried:

import collections
import itertools
import sys
import csv
import glob
import re

def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

for filepath in glob.glob('**/*.html', recursive=True):
	with open(filepath) as f:
	    before = collections.deque(maxlen=10)
	    for line in f:
	        if 'apple' in line:
	            sys.stdout.writelines(before)
	            sys.stdout.write(line)
	            sys.stdout.writelines(itertools.islice(f, 10))
	            break
	        results = before.append(line)
blah = striphtml(results)
print(blah)
The printed code still has HTML tags in it. I don't have to do it in regex; whatever is easiest should be fine.
Use html2text,look at this post.
Also beware that it will not always look good,people who make html has never just all text output in mind.