Reading a Regex pattern - Printable Version

Reading a Regex pattern - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Reading a Regex pattern (/thread-17792.html)

Pages: 1 2

RE: Reading a Regex pattern - stahorse - Apr-25-2019

I did go through them but they don't really give me what I want. I'll show you my code and output. I can get what I want, I'll show you a code for that too, but I'm looking for a better way to do it, I think there is.

First code:

from bs4 import BeautifulSoup

html = '''ZZ~<ResponseCode>NoError</ResponseCode><Items><Message xmlns="http://schemas.micros~<?xml version="1.0" encoding="UTF-8"?><GetItemResponse xmlns="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns~[EXTERNAL] FW: Notification of Creation of SR: 0023TKLB: Reinstate Policy, PolicyNumber - 63321045153, Policyholder - VAN ECK M EN JF VAN ECK BOERDERY CRM:0912353
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
   {font-family:Calibri;
   panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
   {font-family:Tahoma;
   panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
   {margin:0cm;
   margin-bottom:.0001pt;
   font-size:12.0pt;
   font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
   {mso-style-priority:99;
   color:blue;
   text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
   {mso-style-priority:99;
   color:purple;
   text-decoration:underline;}
span.EmailStyle18
   {mso-style-type:personal-reply;
   font-family:"Calibri",sans-serif;
   color:#1F497D;}
.MsoChpDefault
   {mso-style-type:export-only;
   font-size:10.0pt;}
@page WordSection1
   {size:612.0pt 792.0pt;
   margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
   {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body>
  <div id='images'>
    <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
    <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
    <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
  </div>
</body>
'''
soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'lxml')
text = soup.text.strip()
print(text)

Output:

Output:ZZ~NoError


<!--
/* Font Definitions */
@font-face
   {font-family:Calibri;
   panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
   {font-family:Tahoma;
   panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
   {margin:0cm;
   margin-bottom:.0001pt;
   font-size:12.0pt;
   font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
   {mso-style-priority:99;
   color:blue;
   text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
   {mso-style-priority:99;
   color:purple;
   text-decoration:underline;}
span.EmailStyle18
   {mso-style-type:personal-reply;
   font-family:"Calibri",sans-serif;
   color:#1F497D;}
.MsoChpDefault
   {mso-style-type:export-only;
   font-size:10.0pt;}
@page WordSection1
   {size:612.0pt 792.0pt;
   margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
   {page:WordSection1;}
-->

Name: My image 1 
Name: My image 2 
Name: My image 3

If you can see there there, this code returns the Header details as well. to remove the Header, this is what I do.

soup = BeautifulSoup(html, 'lxml')
text = soup.text.strip().split('\n-->')[1].strip()
print(text)

my output:

Output:Name: My image 1 
Name: My image 2 
Name: My image 3

So, I was wondering that I could get something that will remove the header for me.

RE: Reading a Regex pattern - snippsat - Apr-25-2019

(Apr-25-2019, 08:44 AM)stahorse Wrote: So, I was wondering that I could get something that will remove the header for me.

You think of this in wrong way as in remove stuff.
A web-page can have 1000 of lines,so you get stuff needed with parser and leave rest.
Not remove all unwanted stuff which can be a lot.

So to get image info in code over.

>>> img_tag = soup.find('div', id='images')
>>> img_tag
<div id="images">
<a href="image1.html">Name: My image 1 <br/><img src="image1_thumb.jpg"/></a>
<a href="image2.html">Name: My image 2 <br/><img src="image2_thumb.jpg"/></a>
<a href="image3.html">Name: My image 3 <br/><img src="image3_thumb.jpg"/></a>
</div>
>>> 
>>> for img in img_tag.find_all('a'):
...     print(img.get('href'))
...     
image1.html
image2.html
image3.html
>>> for img in img_tag.find_all('img'):
...     print(img.get('src'))     
...     
image1_thumb.jpg
image2_thumb.jpg
image3_thumb.jpg

RE: Reading a Regex pattern - NewBeie - Apr-25-2019

Thank you Snipp.