Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Beutifulsoup: how to pick text that's not in HTML tags?
Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?
If I knew the url of your site, I would have used it for example,
for this, I use

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html
import requests
import sys

def get_stuff():
    page = None
    response = requests.get('')
    if response.status_code == 200:
        page = response.content
        print("c'ant load page")
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()

if __name__ == '__main__':
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup

html = """\
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights

soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.
buran likes this post
Awesome! Thank you, snippsat

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 66 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 651 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 289 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 313 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 1,990 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Getting a specific text inside an html with soup mathieugrimbert 9 3,948 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Decoding html to text string PeterPython 1 719 Aug-12-2018, 07:23 PM
Last Post: Larz60+
  html to text problem Kyle 4 1,929 Apr-27-2018, 09:02 PM
Last Post: snippsat
  How to read html tags dynamically generated? amandacstr 5 2,425 Mar-05-2018, 06:07 AM
Last Post: snippsat
  How to print particular text areas fron an HTML file (not site) Chris 10 2,610 Dec-11-2017, 09:20 AM
Last Post: j.crater

Forum Jump:

Users browsing this thread: 1 Guest(s)