Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting An Object
#1
Hi all,

I'm still pretty new to webscraping and in trying to challenge myself with different scenarios, I've come across a scenario where there is 2 set's of text within the one object.

I'm not sure if I've explained that correctly, so let me first show you the html (please note: I've copied this snippet of the sites html into a local file to practice extracting various elements )

So here's the practice html that I've placed in a text file:

Output:
<div class="links" id="links642584"><ul class="links"><li><i class="fa fa-comment"></i> 40</li><li><span class="tag"><i class="fa fa-tag"></i> <a href="/cat/electrical-electronics">Electrical &amp; Electronics</a></span></li><li><span class="nodeexpiry"><i class="fa fa-calendar"></i> 12 Aug <span class="marker">6 days left</span> </span></li></ul></div>


If I search for the text for span class "nodeexpiry":
from bs4 import BeautifulSoup
import requests
with open("C:/Users/test_html_data.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'html.parser')
litest = soup('li')[2]
test1 = litest.find('span', {'class': 'nodeexpiry'}).text
I get:
Output:
12 Aug 6 Days Left
I know if I searched for the span class = marker, I can just get the 6 days left.

So my question is, how would I go about only extracting the 12 Aug please?
Reply
#2
(Aug-07-2021, 01:05 AM)knight2000 Wrote: So my question is, how would I go about only extracting the 12 Aug please?
After using text the parser has done it's job and can not do anymore.
Can try to exclude/remove then span(with 6 days left) tag first,
but can just use regex to get text wanted when parser has done it's job.
>>> import re
>>> 
>>> tag = soup.find('span', class_="nodeexpiry")
>>> tag
<span class="nodeexpiry"><i class="fa fa-calendar"></i> 12 Aug <span class="marker">6 days left</span> </span>
>>> tag.text.strip()
'12 Aug 6 days left'
>>> 
>>> r = re.search(r'(\d+\s\w+)', tag.text.strip())
>>> r.group(1)
'12 Aug'
Reply
#3
Can you not just ask for the strings? https://www.crummy.com/software/Beautifu...ed-strings.
Reply
#4
Hi Snippsat,

Thanks a lot for your help with this- that's perfect. After your help here I've watched more video's about Regex today and picked some random samples to play with and get more familiar using Regex- very cool.

Have a great day.



(Aug-07-2021, 11:29 AM)snippsat Wrote:
(Aug-07-2021, 01:05 AM)knight2000 Wrote: So my question is, how would I go about only extracting the 12 Aug please?
After using text the parser has done it's job and can not do anymore.
Can try to exclude/remove then span(with 6 days left) tag first,
but can just use regex to get text wanted when parser has done it's job.
>>> import re
>>> 
>>> tag = soup.find('span', class_="nodeexpiry")
>>> tag
<span class="nodeexpiry"><i class="fa fa-calendar"></i> 12 Aug <span class="marker">6 days left</span> </span>
>>> tag.text.strip()
'12 Aug 6 days left'
>>> 
>>> r = re.search(r'(\d+\s\w+)', tag.text.strip())
>>> r.group(1)
'12 Aug'
Reply
#5
(Aug-07-2021, 11:33 AM)ndc85430 Wrote: Can you not just ask for the strings? https://www.crummy.com/software/Beautifu...ed-strings.

Hi ndc85430,

I did have a look at that (thank you for the link) and after reading it, to be honest I'm not sure how to apply it. I'll try having it look at it again during the week to try and understand how it could be applied.

Thank you for your time and for chiming in- much appreciated.
Reply
#6
In your case here they link from ndc85430 may not work.
But can easily to it manually with just standard Python.
>>> s = '12 Aug 6 days left'
>>> s.split()
['12', 'Aug', '6', 'days', 'left']
>>> s.split()[:2]
['12', 'Aug']
>>> ' '.join(s.split()[:2])
'12 Aug'
>>> ' '.join(s.split()[2:])
'6 days left'
So regex can be useful if they text(getting for parser) is longer and need some specific stuff from text.
Anyway learning some basic regex is useful for many cases,there are many tool online that can help eg regex 101
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020