Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Beutifulsoup: how to pick text that's not in HTML tags?
Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?
If I knew the url of your site, I would have used it for example,
for this, I use

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html
import requests
import sys

def get_stuff():
    page = None
    response = requests.get('')
    if response.status_code == 200:
        page = response.content
        print("c'ant load page")
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()

if __name__ == '__main__':
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup

html = """\
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights

soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.
buran likes this post
Awesome! Thank you, snippsat

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web crawler extracting specific text from HTML lewdow 1 607 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 228 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 270 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 1,959 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Getting a specific text inside an html with soup mathieugrimbert 9 3,091 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Decoding html to text string PeterPython 1 681 Aug-12-2018, 07:23 PM
Last Post: Larz60+
  html to text problem Kyle 4 1,779 Apr-27-2018, 09:02 PM
Last Post: snippsat
  How to read html tags dynamically generated? amandacstr 5 2,295 Mar-05-2018, 06:07 AM
Last Post: snippsat
  How to print particular text areas fron an HTML file (not site) Chris 10 2,537 Dec-11-2017, 09:20 AM
Last Post: j.crater
  Selenium to pick data from csv and enter into website Prince_Bhatia 1 2,734 Sep-08-2017, 10:58 AM
Last Post: hbknjr

Forum Jump:

Users browsing this thread: 1 Guest(s)