Reading a Regex pattern - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Reading a Regex pattern (/thread-17792.html) Pages:
1
2
|
RE: Reading a Regex pattern - stahorse - Apr-25-2019 I did go through them but they don't really give me what I want. I'll show you my code and output. I can get what I want, I'll show you a code for that too, but I'm looking for a better way to do it, I think there is. First code: from bs4 import BeautifulSoup html = '''ZZ~<ResponseCode>NoError</ResponseCode><Items><Message xmlns="http://schemas.micros~<?xml version="1.0" encoding="UTF-8"?><GetItemResponse xmlns="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns~[EXTERNAL] FW: Notification of Creation of SR: 0023TKLB: Reinstate Policy, PolicyNumber - 63321045153, Policyholder - VAN ECK M EN JF VAN ECK BOERDERY CRM:0912353 <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medium)"> <!--[if !mso]><style>v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style><![endif]--><style><!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 72.0pt 72.0pt 72.0pt;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> </div> </body> ''' soup = BeautifulSoup(html, 'html.parser') soup = BeautifulSoup(html, 'lxml') text = soup.text.strip() print(text)Output: If you can see there there, this code returns the Header details as well. to remove the Header, this is what I do.soup = BeautifulSoup(html, 'lxml') text = soup.text.strip().split('\n-->')[1].strip() print(text)my output: So, I was wondering that I could get something that will remove the header for me.
RE: Reading a Regex pattern - snippsat - Apr-25-2019 (Apr-25-2019, 08:44 AM)stahorse Wrote: So, I was wondering that I could get something that will remove the header for me.You think of this in wrong way as in remove stuff. A web-page can have 1000 of lines,so you get stuff needed with parser and leave rest. Not remove all unwanted stuff which can be a lot. So to get image info in code over. >>> img_tag = soup.find('div', id='images') >>> img_tag <div id="images"> <a href="image1.html">Name: My image 1 <br/><img src="image1_thumb.jpg"/></a> <a href="image2.html">Name: My image 2 <br/><img src="image2_thumb.jpg"/></a> <a href="image3.html">Name: My image 3 <br/><img src="image3_thumb.jpg"/></a> </div> >>> >>> for img in img_tag.find_all('a'): ... print(img.get('href')) ... image1.html image2.html image3.html >>> for img in img_tag.find_all('img'): ... print(img.get('src')) ... image1_thumb.jpg image2_thumb.jpg image3_thumb.jpg RE: Reading a Regex pattern - NewBeie - Apr-25-2019 Thank you Snipp. |