Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading a Regex pattern
#11
I did go through them but they don't really give me what I want. I'll show you my code and output. I can get what I want, I'll show you a code for that too, but I'm looking for a better way to do it, I think there is.

First code:
from bs4 import BeautifulSoup

html = '''ZZ~<ResponseCode>NoError</ResponseCode><Items><Message xmlns="http://schemas.micros~<?xml version="1.0" encoding="UTF-8"?><GetItemResponse xmlns="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns~[EXTERNAL] FW: Notification of Creation of SR: 0023TKLB: Reinstate Policy, PolicyNumber - 63321045153, Policyholder - VAN ECK M EN JF VAN ECK BOERDERY CRM:0912353
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
   {font-family:Calibri;
   panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
   {font-family:Tahoma;
   panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
   {margin:0cm;
   margin-bottom:.0001pt;
   font-size:12.0pt;
   font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
   {mso-style-priority:99;
   color:blue;
   text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
   {mso-style-priority:99;
   color:purple;
   text-decoration:underline;}
span.EmailStyle18
   {mso-style-type:personal-reply;
   font-family:"Calibri",sans-serif;
   color:#1F497D;}
.MsoChpDefault
   {mso-style-type:export-only;
   font-size:10.0pt;}
@page WordSection1
   {size:612.0pt 792.0pt;
   margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
   {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body>
  <div id='images'>
    <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
    <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
    <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
  </div>
</body>
'''
soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'lxml')
text = soup.text.strip()
print(text)
Output:
Output:
ZZ~NoError <!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --> Name: My image 1 Name: My image 2 Name: My image 3
If you can see there there, this code returns the Header details as well. to remove the Header, this is what I do.

soup = BeautifulSoup(html, 'lxml')
text = soup.text.strip().split('\n-->')[1].strip()
print(text)
my output:

Output:
Name: My image 1 Name: My image 2 Name: My image 3
So, I was wondering that I could get something that will remove the header for me.
Reply
#12
(Apr-25-2019, 08:44 AM)stahorse Wrote: So, I was wondering that I could get something that will remove the header for me.
You think of this in wrong way as in remove stuff.
A web-page can have 1000 of lines,so you get stuff needed with parser and leave rest.
Not remove all unwanted stuff which can be a lot.

So to get image info in code over.
>>> img_tag = soup.find('div', id='images')
>>> img_tag
<div id="images">
<a href="image1.html">Name: My image 1 <br/><img src="image1_thumb.jpg"/></a>
<a href="image2.html">Name: My image 2 <br/><img src="image2_thumb.jpg"/></a>
<a href="image3.html">Name: My image 3 <br/><img src="image3_thumb.jpg"/></a>
</div>
>>> 
>>> for img in img_tag.find_all('a'):
...     print(img.get('href'))
...     
image1.html
image2.html
image3.html
>>> for img in img_tag.find_all('img'):
...     print(img.get('src'))     
...     
image1_thumb.jpg
image2_thumb.jpg
image3_thumb.jpg
Reply
#13
Thank you Snipp.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regex pattern match WJSwan 2 1,247 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  regex pattern to extract relevant sentences Bubly 2 1,863 Jul-06-2021, 04:17 PM
Last Post: Bubly
  Reading a Regex stahorse 4 2,481 May-16-2019, 02:09 PM
Last Post: snippsat
  Regex Pattern NewBeie 5 3,034 May-13-2019, 01:27 PM
Last Post: michalmonday
  Regex, creating a pattern stahorse 5 3,196 Apr-24-2019, 08:29 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020