Python Forum
Beutifulsoup: how to pick text that's not in HTML tags?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Beutifulsoup: how to pick text that's not in HTML tags?
#1
Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?
Reply
#2
If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html
import requests
import sys


def get_stuff():
    page = None
    response = requests.get('https://www.weather.gov/')
    if response.status_code == 200:
        page = response.content
    else:
        print("c'ant load page")
        sys.exit(-1)
    
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()
    print(text)


if __name__ == '__main__':
    get_stuff()
results:
Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.
Reply
#3
Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
Reply
#4
It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup

html = """\
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>"""

soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
'2019.10.10' 
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.
Reply
#5
Awesome! Thank you, snippsat
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Python Obstacles | Jeet-Kune-Do | BS4 (Tags > MariaDB) [URL/Local HTML] BrandonKastning 0 1,388 Feb-08-2022, 08:55 PM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,481 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,388 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  Easy HTML Parser: Validating trs by attributes several tags deep? runswithascript 7 3,452 Aug-14-2020, 10:58 PM
Last Post: runswithascript
  Jinja2 HTML <a> tags not rendering properly ChaitanyaPy 4 3,146 Jun-28-2020, 06:12 PM
Last Post: ChaitanyaPy
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,316 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 3,329 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 3,366 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 2,247 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 3,679 Aug-05-2019, 08:30 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020