Beutifulsoup: how to pick text that's not in HTML tags?

pitonas · (This post was last modified: Oct-08-2018, 10:33 AM by pitonas.)

Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a tag that contains the information that I need to pick.

Travel date: 2019.10.10 
Travel duration: 7 nights


The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling

It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.

  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text

Do you have any ideas why it's not working?

**Larz60+** · (This post was last modified: Oct-08-2018, 11:08 AM by Larz60+.)

If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:

<strong>Travel date:</strong>&nbsp;2019.10.10<br>

right click --> copy --> XPath
paste into code like (your xpath will be dfferent):

xpath = '/html/body/div[5]/div/div[4]/p/a[2]'

Now run code like:

from lxml import html
import requests
import sys


def get_stuff():
    page = None
    response = requests.get('https://www.weather.gov/')
    if response.status_code == 200:
        page = response.content
    else:
        print("c'ant load page")
        sys.exit(-1)
    
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()
    print(text)


if __name__ == '__main__':
    get_stuff()

results:

Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.

pitonas · Oct-08-2018, 12:07 PM

Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...

***snippsat*** · (This post was last modified: Oct-08-2018, 12:47 PM by snippsat.)

It will be in text element of p tag.
Have to do some clean up.

from bs4 import BeautifulSoup

html = """\
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>"""

soup = BeautifulSoup(html, 'lxml')

>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']

Can quick also make a dictionary.

>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
'2019.10.10'

Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...

BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.

pitonas · Oct-08-2018, 01:43 PM

Awesome! Thank you, snippsat

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Obstacles \| Jeet-Kune-Do \| BS4 (Tags > MariaDB) [URL/Local HTML]	BrandonKastning	0	1,424	Feb-08-2022, 08:55 PM Last Post: BrandonKastning
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,645	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,472	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Easy HTML Parser: Validating trs by attributes several tags deep?	runswithascript	7	3,599	Aug-14-2020, 10:58 PM Last Post: runswithascript
	Jinja2 HTML <a> tags not rendering properly	ChaitanyaPy	4	3,249	Jun-28-2020, 06:12 PM Last Post: ChaitanyaPy
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,370	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Web crawler extracting specific text from HTML	lewdow	1	3,407	Jan-03-2020, 11:21 PM Last Post: snippsat
	Help on parsing simple text on HTML	amaumox	5	3,484	Jan-03-2020, 05:50 PM Last Post: amaumox
	Extract text between bold headlines from HTML	CostasG	1	2,332	Aug-31-2019, 10:53 AM Last Post: snippsat
	How do I get rid of the HTML tags in my output?	glittergirl	1	3,731	Aug-05-2019, 08:30 PM Last Post: snippsat

Beutifulsoup: how to pick text that's not in HTML tags?

User Panel Messages

Announcements