Python Forum
Beutifulsoup: how to pick text that's not in HTML tags? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Beutifulsoup: how to pick text that's not in HTML tags? (/thread-13279.html)



Beutifulsoup: how to pick text that's not in HTML tags? - pitonas - Oct-08-2018

Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?


RE: Beutifulsoup: how to pick text that's not in HTML tags? - Larz60+ - Oct-08-2018

If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html
import requests
import sys


def get_stuff():
    page = None
    response = requests.get('https://www.weather.gov/')
    if response.status_code == 200:
        page = response.content
    else:
        print("c'ant load page")
        sys.exit(-1)
    
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()
    print(text)


if __name__ == '__main__':
    get_stuff()
results:
Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.



RE: Beutifulsoup: how to pick text that's not in HTML tags? - pitonas - Oct-08-2018

Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...


RE: Beutifulsoup: how to pick text that's not in HTML tags? - snippsat - Oct-08-2018

It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup

html = """\
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>"""

soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
'2019.10.10' 
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.


RE: Beutifulsoup: how to pick text that's not in HTML tags? - pitonas - Oct-08-2018

Awesome! Thank you, snippsat