Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex Pattern
#1
Hi,

I have two patterns in my code, one to remove the first part of the string which is just garbage, the second is to remove extra newlines in the string, but all of them are not working, please help.

import re
import string

Email = '''

<!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:"Minion Pro";}
@font-face
	{font-family:inherit;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
h4
	{mso-style-priority:9;
	mso-style-link:"Heading 4 Char";
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Calibri",sans-serif;}
h5
	{mso-style-priority:9;
	mso-style-link:"Heading 5 Char";
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:10.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.Heading4Char
	{mso-style-name:"Heading 4 Char";
	mso-style-priority:9;
	mso-style-link:"Heading 4";
	font-family:"Calibri",sans-serif;
	font-weight:bold;}
span.Heading5Char
	{mso-style-name:"Heading 5 Char";
	mso-style-priority:9;
	mso-style-link:"Heading 5";
	font-family:"Calibri",sans-serif;
	font-weight:bold;}
p.msonormal0, li.msonormal0, div.msonormal0
	{mso-style-name:msonormal;
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Times New Roman",serif;}
span.3oh-
	{mso-style-name:_3oh-;}
span.EmailStyle21
	{mso-style-type:personal-reply;
	font-family:"Calibri",sans-serif;
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
-->

<!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
h4
	{mso-style-priority:9;
	mso-style-link:"Heading 4 Char";
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Calibri",sans-serif;}
h5
	{mso-style-priority:9;
	mso-style-link:"Heading 5 Char";
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:10.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
span.Heading4Char
	{mso-style-name:"Heading 4 Char";
	mso-style-priority:9;
	mso-style-link:"Heading 4";
	font-family:"Calibri",sans-serif;
	font-weight:bold;}
span.Heading5Char
	{mso-style-name:"Heading 5 Char";
	mso-style-priority:9;
	mso-style-link:"Heading 5";
	font-family:"Calibri",sans-serif;
	font-weight:bold;}
span.3oh-
	{mso-style-name:_3oh-;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
-->

Hi

Will do thanks

Boss King
Content Publisher 

T. 555-666-777/ M. 555-666-777




From: [email protected];
Sent: Monday, January 28, 2019 12:37:13 PM
To: Team
Subject: RE: Let them know




Hi Iron man,

We can confirm the mission had been cancelled on 25/01/2019
Any contacts made there after should be ignored.

You may notify Me they are persisting

Thank you.


Kind Regards/Vriendelike Groete.






























Iron Man




Client Care Media Manager




Client Care/Kliente Diens









T: #43;27 21915 7556
|
 E: [email protected]













Personal
|
Commercial
|
Specialist
 | Agriculture










From: Almene Barends lt;[email protected];

Sent: 28 January 2019 12:04 PM
To: [email protected];
Subject: Mission



Hi team, please can you assist with this service escalation -client been repeatedly emailed us to cancel her policy
I have requested an email address, do we have another one on file as she is overseas

Thanks







THU 9:48 AM














Good day, I have been emailing for three days to try and stop/cancel my policy as I have immigrated and the house has bee sold.
 And every other department responds except for the one that actually NEEDS to do it! 90421050084








FRI 2:05 PM






Have a nice day


Hi Claire, thank you for bringing this to our attention. 







James Bond
Content Publisher 

T. 555-666-777 / M. 555-666-777



The content of this email is confidential and intended for the addressee only. If it was sent to you in error, please notify the sender immediately and delete the email. 



image003.jpgimage/[email protected]:37:13truefalseimage004.jpgimage/[email protected]:37:13truefalseimage005.jpgimage/[email protected]:37:13truefalse396972019-01-28T10:45:35Z2019-01-28T10:45:47ZfalsefalsedigitalSMTPOneOffMarguerite


'''

pattern = re.compile(r'<!--.*-->')
text = pattern.sub('', Email).strip()
remove_white_space = re.sub(r'\n\n]{1,}', '\n\n', text)
print(remove_white_space)
print(text)
Reply
#2
If you want the dot . to match the newline character, you need to compile the pattern with the re.DOTALL flag.
Reply
#3
Also your newline regex remove is wrong.
Want to remove all extra newline and keep one newline(\n).
import re

pattern = re.compile(r'<!--.*-->',  re.DOTALL)
text_temp = pattern.sub('', Email).strip()
text = re.sub(r'\n+', '\n', text_temp).strip()
print(text)
Reply
#4
import re

with open('email.txt', 'r') as f:
    text = f.read()
    
p = re.compile(r'<!--.*-->',re.DOTALL)
text = p.sub('', text)

p = re.compile(r'^\s*$', re.MULTILINE)
text = p.sub('', text)

print(text)
Reply
#5
Thank you so much, it's doing the most, however I still get this in my output:
Output:
image003.jpgimage/[email protected]:37:13truefalseimage004.jpgimage/[email protected]:37:13truefalseimage005.jpgimage/[email protected]:37:13truefalse396972019-01-28T10:45:35Z2019-01-28T10:45:47ZfalsefalsedigitalSMTPOneOffMarguerite
How do I clean it?

(May-08-2019, 11:50 AM)michalmonday Wrote:
import re

with open('email.txt', 'r') as f:
    text = f.read()
    
p = re.compile(r'<!--.*-->',re.DOTALL)
text = p.sub('', text)

p = re.compile(r'^\s*$', re.MULTILINE)
text = p.sub('', text)

print(text)
Reply
#6
I think it would be necessary to see some more emails with similar lines to filter it out... But assuming that "image003" is actually not a filename then:

You could get rid of it in various ways but each would have its' drawbacks. Each would be associated with a risk (very small when done right) that some valid text will get cut from the message because it resembled that "jpg_image_big_line".

My suggestion would be to filter it based on:
- the begining (it starts with image003, so it would have to make sure the line starts with "image" and 3 digits
- how long is the line and whether it contains spaces (you can see that this line is very long and doesn't have spaces, this will additionally decrease risk of some valid text being cut out by this additional regex)

import re

with open('email.txt', 'r') as f:
    text = f.read()

patterns = [
    re.compile(r'<!--.*-->',re.DOTALL),
    re.compile(r'^\s*$', re.MULTILINE),
    re.compile(r'^image\d{3}[^\s]{10,}', re.MULTILINE)
    ]

for p in patterns:
    text = p.sub('', text)

print(text)


'''
Details/description of this line: '^image\d{3}[^\s]{10,}'

^ - begining of line
image - text itself
\d{3} - 3 digits
[^\s]{10,} - at least 10 chars following "image003" not being whitespace
'''
Edit: I'm a moron, image003.jpg must be a filename... It could be filtered based on other things but it would be much better to see more examples of emails (just to avoid writting patterns that end up being inefficient)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regex pattern match WJSwan 2 1,181 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  regex pattern to extract relevant sentences Bubly 2 1,834 Jul-06-2021, 04:17 PM
Last Post: Bubly
  Reading a Regex pattern stahorse 12 5,096 Apr-25-2019, 10:21 AM
Last Post: NewBeie
  Regex, creating a pattern stahorse 5 3,114 Apr-24-2019, 08:29 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020