Python Forum
Deleting characters between certain characters
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Deleting characters between certain characters
#1
Hi,

I have this file here:

Output:
<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style> <!-- @font-face {font-family:Wingdings} @font-face {font-family:"Cambria Math"} @font-face {font-family:Calibri} @font-face {font-family:"Century Gothic"} p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; font-size:11.0pt; font-family:"Calibri",sans-serif} a:link, span.MsoHyperlink {color:blue; text-decoration:underline} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {margin-top:0cm; margin-right:0cm; margin-bottom:0cm; margin-left:36.0pt; font-size:11.0pt; font-family:"Calibri",sans-serif} span.rsnormal {} span.EmailStyle21 {font-family:"Calibri",sans-serif; color:windowtext} .MsoChpDefault {font-size:10.0pt} @page WordSection1 {margin:72.0pt 72.0pt 72.0pt 72.0pt} div.WordSection1 {} ol {margin-bottom:0cm} ul {margin-bottom:0cm} -->
I want to delete everything between "<!--" and "-->", including those charaters themselves.

Please help.
Reply
#2
Could you not simply create a 'new_file', from the 'file' by using string slicing?

# simulate a file read
file = '''<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style>
<!--
@font-face
    {font-family:Wingdings}
@font-face
    {font-family:"Cambria Math"}
@font-face
    {font-family:Calibri}
@font-face
    {font-family:"Century Gothic"}
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0cm;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
    {color:blue;
    text-decoration:underline}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
    {margin-top:0cm;
    margin-right:0cm;
    margin-bottom:0cm;
    margin-left:36.0pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
span.rsnormal
    {}
span.EmailStyle21
    {font-family:"Calibri",sans-serif;
    color:windowtext}
.MsoChpDefault
    {font-size:10.0pt}
@page WordSection1
    {margin:72.0pt 72.0pt 72.0pt 72.0pt}
div.WordSection1
    {}
ol
    {margin-bottom:0cm}
ul
    {margin-bottom:0cm}
-->more stuff
that could be here'''

first = file.find('<!--')
last = file.find('-->')

new_file = file[:first]
new_file += file[last + 3:]

# this is what you would write to a new file
print(new_file)
Output:
<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style> more stuff that could be here
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
Thank you so much, essentially that's what I would need, although it's just the beginning. I need to get rid of HTML tags and leave the normal text. Everything HTML tags need to go. One of the normal text in that file would be "Any feedback on the email below"

Here's a complete file:

Quote:<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style>
<!--
@font-face
{font-family:Wingdings}
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
@font-face
{font-family:"Century Gothic"}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
span.rsnormal
{}
span.EmailStyle21
{font-family:"Calibri",sans-serif;
color:windowtext}
.MsoChpDefault
{font-size:10.0pt}
@page WordSection1
{margin:72.0pt 72.0pt 72.0pt 72.0pt}
div.WordSection1
{}
ol
{margin-bottom:0cm}
ul
{margin-bottom:0cm}
-->
</style><meta name="viewport" content="width=device-width, initial-scale=1"></head><body lang="EN-ZA" link="blue" vlink="purple" style="word-wrap:break-word"><div class="layout banner"><table bgcolor="" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"><a href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/" data-href=""><img src="cid:[email protected]" class="rocketseed-strip banner" border="0" id="img62308" style="max-width:100%; height:auto"></a></td></tr></tbody></table><br><div class="WordSection1"><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Good day</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Any feedback on the email below.</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Thanks</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><div><div style="border:none; padding:3.0pt 0cm 0cm 0cm"><table border="0" cellspacing="0" cellpadding="0"><tbody><tr><td align="left"><table width="500" border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif"><tbody><tr><td style="color:#344355; font-size:18px; font-weight:bold; padding-left:5px"><span class="RSNormal RSName">Talita</span> <span class="RSNormal RSName">Bothma</span><sup><span class="RSNormal"></span></sup></td></tr><tr><td><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif"><tbody><tr><td style="color:#5e5c5a; font-size:12px; padding-left:5px"><span><span class="RSNormal">Personal Assistant</span></span></td></tr></tbody></table></td></tr></tbody></table></td></tr><tr><td colspan="2"></td></tr><tr><td valign="top"><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; color:#5e5c5a"><tbody><tr></tr><tr><td style="font-size:12px; padding-left:5px"><span style="color:#36435">e:</span> <a href="mailto:[email protected]" style="color:#5e5c5a; text-decoration:none"><span class="RSNormal">[email protected]</span></a></td></tr><tr><td style="font-size:12px; padding-left:5px"><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; color:#5e5c5a"><tbody><tr><td style="font-size:12px"><span style="color:#364356">t: </span><span class="RSNormal">022&nbsp;482 2743</span>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</td><td style="font-size:12px"><span style="color:#364356">f: </span><span class="RSNormal">022 487 5270</span>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</td><td style="font-size:12px"><span class="RSNormal"><a href="https://prise.mimecast.com/s/wDDSCvgxPzT7z66JqS5Xbs3?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c00ovgGW" data-name="VCard" class="tracker" style="font-size:10px; text-decoration:none; color:#364356">VCard</a></span></td></tr></tbody></table></td></tr><tr></tr><tr><td style="font-size:12px; padding-left:5px"><span class="RSNormal">29 Hof Street, Malmesbury, 7299</span> - <span class="RSNormal"><a href="https://prise.mimecast.com/s/z1ouCwjyQAhG3115kHxVgvv?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c00u_zW_" data-name="Maps" class="tracker" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; font-size:10px; color:#5e5c5a; text-decoration:none">View Map</a></span></td></tr><tr></tr><tr><td style="font-size:12px; padding-top:10px; padding-left:5px">Efficient Financial Services (Pty) Ltd, is an authorised financial services provider, FSP 655</td></tr><tr><td style="font-size:12px; padding-top:10px; padding-left:5px"><span class="RSNormal"><a href="https://prise.mimecast.com/s/emiWCxGzRBi1oAApDfgDWoc?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c022f8Lh" data-name="Disc" class="tracker" style="font-size:10px; text-decoration:none; color:#344355">View our Disclaimer</a></span>&nbsp;|&nbsp;<span class="RSNormal"><a href="https://prise.mimecast.com/s/0FZECy8AVDur5MMGKuxyBAD?domain=documents.efgroup.co.za" class="tracker" readonly="readonly" style="font-size:10px; text-decoration:none; color:#344355">Privacy Policy</a></span></td></tr></tbody></table></td></tr><tr><td colspan="2"><img src="cid:[email protected]" class="rocketseed-strip banner" id="img44491" usemap="#imgmap20191118838" border="0" style="max-width:100%; height:auto"> <map id="imgmap20191118838" name="imgmap20191118838"><area shape="rect" alt="Website" title="Website" coords="6,4,33,33" href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/" target=""><area shape="rect" alt="Instagram" title="Instagram" coords="42,4,69,33" href="https://prise.mimecast.com/s/xkKbCzm4WEuMo22XEH7E0Gw?domain=instagram.com" target=""><area shape="rect" alt="Facebook" title="Facebook" coords="76,5,104,33" href="https://prise.mimecast.com/s/EwB3CAnXv2cNAWWxBtws3sE?domain=ent.com" target=""></map></td></tr></tbody></table><table id="rs-bottom-banner" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"></td></tr></tbody></table><p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span lang="EN-US"> Frank Slier &lt;[email protected]&gt; <br><b>Sent:</b> Friday, June 16, 2023 11:11 AM<br><b>To:</b> CPT Commercial (Outbound) &lt;[email protected]&gt;<br><b>Cc:</b> Horse Brown &lt;[email protected]&gt;<br><b>Subject:</b> 63120368337 - MBD JOINERY(PTY)LTD</span></p></div></div><p class="MsoNormal">&nbsp;</p><div><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><a href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/"><span style="text-decoration:none">&nbsp;</span></a></p></td></tr></tbody></table><p class="MsoNormal">&nbsp;</p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">Voeg asb by vanaf vandag :</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">HUISEIENAARS</span></li></ul><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">WOONHUIS MET LOSSTAANDE WOONSTEL</span></li></ul><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">SPURWING STRAAT 3</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HEMEL &amp; AARDE ESTATE</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HERMANUS</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">7200</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">STANDAARD DAK/MURE</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">PERMANENT BEWOON DEUR EIENAAR</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">BEDAGS ONBEWOON</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">NIE MEER AS 60 DAE ONBEWOON PER JAAR</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">NIE NADER AS 100 METER VAN SEE,RIVIER,DAM</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">SEKERUTEIT ESTATE / 24 UUR PATROLLIE MET HONDE EN VOERTUIE/GEELEKTRIFISEERDE HEINING/24 UUR WAGTE BY HEK</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">VERSEKERDE BEDRAG – R 4&nbsp;500&nbsp;000</span></li></ul><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">HUISINHOUD/INGESLUIT LOSSTAANDE WOONSTEL</span></li></ul><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">SPURWING STRAAT 3</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HEMEL &amp; AARDE ESTATE</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HERMANUS</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">7200</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">SOOS BO GENOEM/SEKERUTEIT</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">VERSEKERDE BEDRAG --- R 750&nbsp;000</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">TOEVALLIGGE SKADE/MEGANIESE/ELEKTRIESE ONKLAARRAKING/KRAGSTUWING ----- R 15&nbsp;000 ELK</span></li></ul><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">Groete</span></p><p class="MsoNormal" style="margin-left:18.0pt"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal">&nbsp;</p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="500" style="width:375.0pt"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><b><span style="font-size:13.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#19325C">Frank</span></b></span><b><span style="font-size:13.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#19325C"> <span class="rsnormal">Slier</span></span></b></p></td></tr><tr><td style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">Registered Financial Advisor</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"></span></p></td></tr></tbody></table></td></tr></tbody></table></td></tr><tr><td colspan="2" style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td valign="top" style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">e: <a href="mailto:[email protected]"><span class="rsnormal"><span style="color:#5E5C5A; text-decoration:none">[email protected]</span></span></a></span></p></td></tr><tr><td style="padding:0cm 0cm 0cm 3.75pt"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#1A7D88">t: </span><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">022&nbsp;482 2743</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</span></p></td><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#1A7D88">c: </span><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">066 345 5450</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</span></p></td><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"><a href="https://prise.mimecast.com/s/7X5UCoYnG0hr3zzA5uzB3zz?domain=branding.sef.co.za"><span style="font-size:7.5pt; color:#364356; text-decoration:none">VCard</span></a></span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"></span></p></td></tr></tbody></table></td></tr><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">44 Crest Road, Pearly Beach, 7220</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"> - <span class="rsnormal"><a href="https://prise.mimecast.com/s/-LKNCpgoJqTn388gqFYx5x7?domain=branding.sef.co.za"><span style="font-size:7.5pt; color:#5E5C5A; text-decoration:none">View Map</span></a></span></span></p></td></tr><tr><td style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td style="padding:7.5pt 0cm 0cm 3.75pt"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">Efficient Financial Services (Pty) Ltd, is an authorised financial services provider, FSP 655</span></p></td></tr><tr><td style="padding:7.5pt 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:7.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#344355"><a href="https://prise.mimecast.com/s/FAJQCqjpKrh8j55gnUE2XEw?domain=efgroup.co.za/">View our Disclaimer</a></span></span><span style="font-size:7.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#344355">&nbsp;|&nbsp;<span class="rsnormal"><a href="https://prise.mimecast.com/s/U_JzCr0qLvc8lvvy7UjTrg-?domain=branding.sef.co.za"><span style="color:#344355; text-decoration:none">Privacy Policy</span></a></span></span></p></td></tr></tbody></table></td><td style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td colspan="2" style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal">&nbsp;</p></td></tr></tbody></table></td></tr></tbody></table><p class="MsoNormal"><span style="display:none">&nbsp;</span></p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" id="rs-bottom-banner" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"></td></tr></tbody></table><p class="MsoNormal"><span style="display:none">&nbsp;</span></p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" id="rs-disclaimer" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"></td></tr></tbody></table><p class="MsoNormal">&nbsp;</p></div></div><br><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"></td></tr></tbody></table><table id="rs-disclaimer" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"><br></td></tr></tbody></table></div></body></html>
Reply
#4
So, why can't that file be parsed, as the HTML file that it is and then 'scraped'?
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#5
import re

# Read in the file
with open("filename.html", "r") as f:
    content = f.read()

# Remove everything between "<!--" and "-->", including the characters themselves
pattern = re.compile("<!--(.*?)-->", re.DOTALL)
content = re.sub(pattern, "", content)

# Write the modified content back to the file
with open("filename.html", "w") as f:
    f.write(content)
This code reads in the HTML file, removes everything between "<!--" and "-->", and then writes the modified content back to the file. The regular expression <!--(.*?)--> matches any text between "<!--" and "-->", and the re.DOTALL flag tells Python to match newlines as well. <link removed>
Gribouillis write Jun-29-2023, 08:10 AM:
lease post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

Promotion link removed, please read What to NOT include in a post
buran write Jun-29-2023, 08:08 AM:
Please, use proper tags when post code, traceback, output, etc. This time I have added tags for you.
See BBcode help for more info.
Reply
#6
(Jun-29-2023, 08:02 AM)createcarved Wrote:
import re

# Read in the file
with open("filename.html", "r") as f:
    content = f.read()

# Remove everything between "<!--" and "-->", including the characters themselves
pattern = re.compile("<!--(.*?)-->", re.DOTALL)
content = re.sub(pattern, "", content)

# Write the modified content back to the file
with open("filename.html", "w") as f:
    f.write(content)
This code reads in the HTML file, removes everything between "<!--" and "-->", and then writes the modified content back to the file. The regular expression <!--(.*?)--> matches any text between "<!--" and "-->", and the re.DOTALL flag tells Python to match newlines as well. <link removed>

Thank you, this is moving to the right direction. however, after writing to the file again, this is how my file looks like:
Quote:<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="Generator" content="Microsoft Word 15 (filtered medium)"><style>

</style><meta name="viewport" content="width=device-width, initial-scale=1"></head><body lang="EN-ZA" link="blue" vlink="purple" style="word-wrap:break-word"><div class="layout banner"><table bgcolor="" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"><a href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/" data-href=""><img src="cid:[email protected]" class="rocketseed-strip banner" border="0" id="img62308" style="max-width:100%; height:auto"></a></td></tr></tbody></table><br><div class="WordSection1"><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Good day</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Any feedback on the email below.</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">Thanks</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><p class="MsoNormal"><span style="">&nbsp;</span></p><div><div style="border:none; padding:3.0pt 0cm 0cm 0cm"><table border="0" cellspacing="0" cellpadding="0"><tbody><tr><td align="left"><table width="500" border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif"><tbody><tr><td style="color:#344355; font-size:18px; font-weight:bold; padding-left:5px"><span class="RSNormal RSName">Horse</span> <span class="RSNormal RSName">Brown</span><sup><span class="RSNormal"></span></sup></td></tr><tr><td><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif"><tbody><tr><td style="color:#5e5c5a; font-size:12px; padding-left:5px"><span><span class="RSNormal">Personal Assistant</span></span></td></tr></tbody></table></td></tr></tbody></table></td></tr><tr><td colspan="2"></td></tr><tr><td valign="top"><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; color:#5e5c5a"><tbody><tr></tr><tr><td style="font-size:12px; padding-left:5px"><span style="color:#36435">e:</span> <a href="mailto:[email protected]" style="color:#5e5c5a; text-decoration:none"><span class="RSNormal">[email protected]</span></a></td></tr><tr><td style="font-size:12px; padding-left:5px"><table border="0" cellspacing="0" cellpadding="0" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; color:#5e5c5a"><tbody><tr><td style="font-size:12px"><span style="color:#364356">t: </span><span class="RSNormal">000&nbsp;000 0000</span>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</td><td style="font-size:12px"><span style="color:#364356">f: </span><span class="RSNormal">000 000 0000</span>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</td><td style="font-size:12px"><span class="RSNormal"><a href="https://prise.mimecast.com/s/wDDSCvgxPzT7z66JqS5Xbs3?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c00ovgGW" data-name="VCard" class="tracker" style="font-size:10px; text-decoration:none; color:#364356">VCard</a></span></td></tr></tbody></table></td></tr><tr></tr><tr><td style="font-size:12px; padding-left:5px"><span class="RSNormal">00 My Street, Place, 0000</span> - <span class="RSNormal"><a href="https://prise.mimecast.com/s/z1ouCwjyQAhG3115kHxVgvv?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c00u_zW_" data-name="Maps" class="tracker" style="font-family:Century Gothic,Arial,Helvetica,sans-serif; font-size:10px; color:#5e5c5a; text-decoration:none">View Map</a></span></td></tr><tr></tr><tr><td style="font-size:12px; padding-top:10px; padding-left:5px"></td></tr><tr><td style="font-size:12px; padding-top:10px; padding-left:5px"><span class="RSNormal"><a href="https://prise.mimecast.com/s/emiWCxGzRBi1oAApDfgDWoc?domain=branding.sef.co.za" data-rel="https://branding.sef.co.za/rs/c022f8Lh" data-name="Disc" class="tracker" style="font-size:10px; text-decoration:none; color:#344355">View our Disclaimer</a></span>&nbsp;|&nbsp;<span class="RSNormal"><a href="https://prise.mimecast.com/s/0FZECy8AVDur5MMGKuxyBAD?domain=documents.efgroup.co.za" class="tracker" readonly="readonly" style="font-size:10px; text-decoration:none; color:#344355"></a></span></td></tr></tbody></table></td></tr><tr><td colspan="2"><img src="cid:[email protected]" class="rocketseed-strip banner" id="img44491" usemap="#imgmap20191118838" border="0" style="max-width:100%; height:auto"> <map id="imgmap20191118838" name="imgmap20191118838"><area shape="rect" alt="Website" title="Website" coords="6,4,33,33" href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/" target=""><area shape="rect" alt="Instagram" title="Instagram" coords="42,4,69,33" href="https://prise.mimecast.com/s/xkKbCzm4WEuMo22XEH7E0Gw?domain=instagram.com" target=""><area shape="rect" alt="Facebook" title="Facebook" coords="76,5,104,33" href="https://prise.mimecast.com/s/EwB3CAnXv2cNAWWxBtws3sE?domain=ent.com" target=""></map></td></tr></tbody></table><table id="rs-bottom-banner" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"></td></tr></tbody></table><p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span lang="EN-US"> New User &lt;[email protected]&gt; <br><b>Sent:</b> Friday, June 16, 2023 11:11 AM<br><b>To:</b> TTT Commercial (Outbound) &lt;[email protected]&gt;<br><b>Cc:</b> Horse Brown &lt;[email protected]&gt;<br><b>Subject:</b> 00000 - MY COMPANY(PTY)LTD</span></p></div></div><p class="MsoNormal">&nbsp;</p><div><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><a href="https://prise.mimecast.com/s/gRluCnZmEoc7Noo4gS9huai?domain=sef.co.za/"><span style="text-decoration:none">&nbsp;</span></a></p></td></tr></tbody></table><p class="MsoNormal">&nbsp;</p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">Voeg asb by vanaf vandag :</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">HUISEIENAARS</span></li></ul><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">WOONHUIS MET LOSSTAANDE WOONSTEL</span></li></ul><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">NUWE STRAAT 3</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HEL &amp; NEW ESTATE</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">PLEEK</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">0000</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">STANDAARD DAK/MURE</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">PERMANENT BEWOON DEUR EIENAAR</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">BEDAGS ONBEWOON</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">NIE MEER AS 60 DAE ONBEWOON PER JAAR</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">NIE NADER AS 100 METER VAN SEE,RIVIER,DAM</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">ANOTHER ESTATE / 24 UUR PATROLLIE MET HONDE EN VOERTUIE/GEELEKTRIFISEERDE HEINING/24 UUR WAGTE BY HEK</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">VERSEKERDE BEDRAG – R 4&nbsp;500&nbsp;000</span></li></ul><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">HUISINHOUD/INGESLUIT LOSSTAANDE WOONSTEL</span></li></ul><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">NUWE STRAAT 3</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">HEL &amp; NUWE ESTATE</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">PLEEK</span></p><p class="MsoListParagraph"><span lang="NL" style="font-size:12.0pt">0000</span></p><ul type="disc" style="margin-top:0cm"><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">SOOS BO GENOEM/SEKERUTEIT</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">VERSEKERDE BEDRAG --- R 750&nbsp;000</span></li><li class="MsoListParagraph" style="margin-left:0cm"><span lang="NL" style="font-size:12.0pt">TOEVALLIGGE SKADE/MEGANIESE/ELEKTRIESE ONKLAARRAKING/KRAGSTUWING ----- R 15&nbsp;000 ELK</span></li></ul><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal"><span lang="NL" style="font-size:12.0pt">Groete</span></p><p class="MsoNormal" style="margin-left:18.0pt"><span lang="NL" style="font-size:12.0pt">&nbsp;</span></p><p class="MsoNormal">&nbsp;</p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="500" style="width:375.0pt"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><b><span style="font-size:13.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#19325C">User</span></b></span><b><span style="font-size:13.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#19325C"> <span class="rsnormal">New</span></span></b></p></td></tr><tr><td style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">Registered User</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"></span></p></td></tr></tbody></table></td></tr></tbody></table></td></tr><tr><td colspan="2" style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td valign="top" style="padding:0cm 0cm 0cm 0cm"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">e: <a href="mailto:[email protected]"><span class="rsnormal"><span style="color:#5E5C5A; text-decoration:none">[email protected]</span></span></a></span></p></td></tr><tr><td style="padding:0cm 0cm 0cm 3.75pt"><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#1A7D88">t: </span><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">000&nbsp;000 0002</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</span></p></td><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#1A7D88">c: </span><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">000 000 0001</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;</span></p></td><td style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"><a href="https://prise.mimecast.com/s/7X5UCoYnG0hr3zzA5uzB3zz?domain=branding.sef.co.za"><span style="font-size:7.5pt; color:#364356; text-decoration:none">VCard</span></a></span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"></span></p></td></tr></tbody></table></td></tr><tr><td style="padding:0cm 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">44 Ave Road, Address, 0000</span></span><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A"> - <span class="rsnormal"><a href="https://prise.mimecast.com/s/-LKNCpgoJqTn388gqFYx5x7?domain=branding.sef.co.za"><span style="font-size:7.5pt; color:#5E5C5A; text-decoration:none">View Map</span></a></span></span></p></td></tr><tr><td style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td style="padding:7.5pt 0cm 0cm 3.75pt"><p class="MsoNormal"><span style="font-size:9.0pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#5E5C5A">New Company (Pty) Ltd, is an authorised financial services provider, </span></p></td></tr><tr><td style="padding:7.5pt 0cm 0cm 3.75pt"><p class="MsoNormal"><span class="rsnormal"><span style="font-size:7.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#344355"><a href="https://prise.mimecast.com/s/FAJQCqjpKrh8j55gnUE2XEw?domain=efgroup.co.za/">View our Disclaimer</a></span></span><span style="font-size:7.5pt; font-family:&quot;Century Gothic&quot;,sans-serif; color:#344355">&nbsp;|&nbsp;<span class="rsnormal"><a href="https://prise.mimecast.com/s/U_JzCr0qLvc8lvvy7UjTrg-?domain=branding.sef.co.za"><span style="color:#344355; text-decoration:none"></span></a></span></span></p></td></tr></tbody></table></td><td style="padding:0cm 0cm 0cm 0cm"></td></tr><tr><td colspan="2" style="padding:0cm 0cm 0cm 0cm"><p class="MsoNormal">&nbsp;</p></td></tr></tbody></table></td></tr></tbody></table><p class="MsoNormal"><span style="display:none">&nbsp;</span></p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" id="rs-bottom-banner" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"></td></tr></tbody></table><p class="MsoNormal"><span style="display:none">&nbsp;</span></p><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" id="rs-disclaimer" style="width:100.0%"><tbody><tr><td style="padding:0cm 0cm 0cm 0cm; max-width:100%"></td></tr></tbody></table><p class="MsoNormal">&nbsp;</p></div></div><br><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"></td></tr></tbody></table><table id="rs-disclaimer" width="100%" border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="max-width:100%; height:auto"><br></td></tr></tbody></table></div></body></html>

But I only want to extract normal text which is :
Quote:Good day

Any feedback on the email below.

Thanks
from the first email,

then the second email:
Quote:Voeg asb by vanaf vandag :


HUISEIENAARS
WOONHUIS MET LOSSTAANDE WOONSTEL
NUWE STRAAT 3

HEL & NEW ESTATE

PLEEK

7200

STANDAARD DAK/MURE
PERMANENT BEWOON DEUR EIENAAR
BEDAGS ONBEWOON
NIE MEER AS 60 DAE ONBEWOON PER JAAR
NIE NADER AS 100 METER VAN SEE,RIVIER,DAM
ANOTHER ESTATE / 24 UUR PATROLLIE MET HONDE EN VOERTUIE/GEELEKTRIFISEERDE HEINING/24 UUR WAGTE BY HEK
VERSEKERDE BEDRAG – R 4 500 000


HUISINHOUD/INGESLUIT LOSSTAANDE WOONSTEL
SPURWING STRAAT 3

HEL & NUWE ESTATE

PLEEK

7200

SOOS BO GENOEM/SEKERUTEIT
VERSEKERDE BEDRAG --- R 750 000
TOEVALLIGGE SKADE/MEGANIESE/ELEKTRIESE ONKLAARRAKING/KRAGSTUWING ----- R 15 000 ELK

Groete

The rest of the "Normal Text" is part of the signature, an email address, which I don't need. I only need the body of the email.
Reply
#7
I managed to clean the file to this point:

Quote: Good day Any feedback on the email below. Thanks Horse BrownPersonal Assistante: [email protected]: 000 000 0000 f: 000 000 0000 VCard00 My Street, Place, 0000 View MapView our Disclaimer From: New User & [email protected] Sent: Friday, June 16, 2023 11:11 AMTo: TTT Commercial (Outbound) & [email protected] Cc: Horse Brown & [email protected] Subject: 00000 MY COMPANY(PTY)LTD Voeg asb by vanaf vandag : HUISEIENAARSWOONHUIS MET LOSSTAANDE WOONSTELNUWE STRAAT 3HEL NEW ESTATEPLEEK0000STANDAARD DAK/MUREPERMANENT BEWOON DEUR EIENAARBEDAGS ONBEWOONNIE MEER AS 60 DAE ONBEWOON PER JAARNIE NADER AS 100 METER VAN SEE,RIVIER,DAMANOTHER ESTATE / 24 UUR PATROLLIE MET HONDE EN VOERTUIE/GEELEKTRIFISEERDE HEINING/24 UUR WAGTE BY HEKVERSEKERDE BEDRAG – R 4 500 000 HUISINHOUD/INGESLUIT LOSSTAANDE WOONSTELNUWE STRAAT 3HEL NUWE ESTATEPLEEK0000SOOS BO GENOEM/SEKERUTEITVERSEKERDE BEDRAG R 750 000TOEVALLIGGE SKADE/MEGANIESE/ELEKTRIESE ONKLAARRAKING/KRAGSTUWING R 15 000 ELK Groete User NewRegistered Usere: [email protected]: 000 000 0002 c: 000 000 0001 VCard44 Ave Road, Address, 0000 View MapNew Company (Pty) Ltd, is an authorised financial services provider, View our Disclaimer

I am now struggling to remove signatures and email address.
Reply
#8
You should ask the experts here, but I'm pretty sure getting text from html is exactly what BeautifulSoup does!

I have only tried webscraping a couple of times, so I am not an expert.

import requests
from bs4 import BeautifulSoup as bs

# Request the page
# you won't get the page it is protected by cloudflare
# you get a cloudflare response
page = requests.get('https://www.fundera.com/blog/business-finance-terms-and-definitions')

# Parse the page
soup = bs(page.text, 'html.parser')
# Find all the p tags
# set other tags to find them
# this returns a list of all p tags and everything within them
p_tag = soup.find_all('p')
tmp = []
for tag in p_tag:
    tmp.append(tag.text)

for t in tmp:
    print(t)
For getting text from emails check this link.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pytables : Printing without special characters Robotguy 0 1,690 Nov-06-2020, 10:55 PM
Last Post: Robotguy
  Creating a code to make unreadable pdf characters readable Tegendraads 1 1,697 Feb-03-2020, 10:08 PM
Last Post: Larz60+
  how to include characters in a diagram with matplotlib atlass218 4 3,076 Sep-24-2018, 08:53 PM
Last Post: atlass218
  Removing characters from columns in data frame kiton 15 58,501 Apr-17-2017, 07:01 PM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020