May-13-2019, 01:27 PM
(This post was last modified: May-13-2019, 01:27 PM by michalmonday.)
I think it would be necessary to see some more emails with similar lines to filter it out... But assuming that "image003" is actually not a filename then:
You could get rid of it in various ways but each would have its' drawbacks. Each would be associated with a risk (very small when done right) that some valid text will get cut from the message because it resembled that "jpg_image_big_line".
My suggestion would be to filter it based on:
- the begining (it starts with image003, so it would have to make sure the line starts with "image" and 3 digits
- how long is the line and whether it contains spaces (you can see that this line is very long and doesn't have spaces, this will additionally decrease risk of some valid text being cut out by this additional regex)
You could get rid of it in various ways but each would have its' drawbacks. Each would be associated with a risk (very small when done right) that some valid text will get cut from the message because it resembled that "jpg_image_big_line".
My suggestion would be to filter it based on:
- the begining (it starts with image003, so it would have to make sure the line starts with "image" and 3 digits
- how long is the line and whether it contains spaces (you can see that this line is very long and doesn't have spaces, this will additionally decrease risk of some valid text being cut out by this additional regex)
import re with open('email.txt', 'r') as f: text = f.read() patterns = [ re.compile(r'<!--.*-->',re.DOTALL), re.compile(r'^\s*$', re.MULTILINE), re.compile(r'^image\d{3}[^\s]{10,}', re.MULTILINE) ] for p in patterns: text = p.sub('', text) print(text) ''' Details/description of this line: '^image\d{3}[^\s]{10,}' ^ - begining of line image - text itself \d{3} - 3 digits [^\s]{10,} - at least 10 chars following "image003" not being whitespace '''Edit: I'm a moron, image003.jpg must be a filename... It could be filtered based on other things but it would be much better to see more examples of emails (just to avoid writting patterns that end up being inefficient)