Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Beutifulsoup: how to pick text that's not in HTML tags?
#1
Hello guys,

I'm building a web scraper and everything went smooth so far until I came across such situation:

There is a <p> tag that contains the information that I need to pick.
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>

The problem is that I need to pick the date (2019.10.10) and the number of nights (7 nights) only.

travel_date = inner_page_soup.find('strong', text='Travel date:').next_sibling
It works until there is no such "sibling" and I get such error:
AttributeError: 'NoneType' object has no attribute 'next_sibling'

I've added a line to check if the variable is None and find another info.
  if travel_date is None:
travel_date = inner_page_soup.find('div', {"class":"info"}).span.text
Do you have any ideas why it's not working?
Quote
#2
If I knew the url of your site, I would have used it for example,
for this, I use https://www.weather.gov/

load the web site into chrome or firefox.
highlight the text you are interested in and right click
choose inspect element, move cursor in inspect over text node:
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
right click --> copy --> XPath
paste into code like (your xpath will be dfferent):
xpath = '/html/body/div[5]/div/div[4]/p/a[2]'
Now run code like:
from lxml import html
import requests
import sys


def get_stuff():
    page = None
    response = requests.get('https://www.weather.gov/')
    if response.status_code == 200:
        page = response.content
    else:
        print("c'ant load page")
        sys.exit(-1)
    
    tree = tree = html.fromstring((page))
    # replace with your xpath
    node = tree.xpath('/html/body/div[4]/div[2]/div[1]/div[2]/div/div[2]/p')
    text = node[0].text.strip()
    print(text)


if __name__ == '__main__':
    get_stuff()
results:
Output:
A slow moving storm system will bring a continued threat for heavy snow over the Rockies, heavy rain, flooding,and severe weather over the Plains into midweek. Over the Gulf of Mexico, Tropical Storm Michael is expected tostrengthen into a hurricane and cause direct impacts to the northeast Gulf Coast by midweek. Heavy rain from Michael could once again impact the Carolinas late week.
Quote
#3
Larz60+,

thank you for your quick answer! Unfortunately, I can't share the web URL publicly.

I do managed to make it work according to your example, however, I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
Quote
#4
It will be in text element of p tag.
Have to do some clean up.
from bs4 import BeautifulSoup

html = """\
<p>
<strong>Travel date:</strong>&nbsp;2019.10.10<br>
<strong>Travel duration:</strong>&nbsp;7 nights
</p>"""

soup = BeautifulSoup(html, 'lxml')
>>> s = soup.find('p').text
>>> s
'\nTravel date:\xa02019.10.10\nTravel duration:\xa07 nights\n'
>>> s = s.strip().replace('\xa0', ' ').split('\n')
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
Can quick also make a dictionary.
>>> s
['Travel date: 2019.10.10', 'Travel duration: 7 nights']
>>> d = dict([i.split(': ') for i in s])
>>> d
{'Travel date': '2019.10.10', 'Travel duration': '7 nights'}
>>> d['Travel date']
'2019.10.10' 
Quote: I've never used lxml parser before and I would need to remake my whole code. It looks so much easier to use xpaths...
BS has support for CSS selector,lxml support both XPath and CSS selector.
I find CSS selector fine to use in BS.
There are example of use of CSS selector/XPath in BS and lxml in Web-Scraping part-1.
buran likes this post
Quote
#5
Awesome! Thank you, snippsat
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web crawler extracting specific text from HTML lewdow 1 607 Jan-03-2020, 11:21 PM
Last Post: snippsat
  Help on parsing simple text on HTML amaumox 5 228 Jan-03-2020, 05:50 PM
Last Post: amaumox
  Extract text between bold headlines from HTML CostasG 1 270 Aug-31-2019, 10:53 AM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 1,959 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Getting a specific text inside an html with soup mathieugrimbert 9 3,091 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  Decoding html to text string PeterPython 1 681 Aug-12-2018, 07:23 PM
Last Post: Larz60+
  html to text problem Kyle 4 1,779 Apr-27-2018, 09:02 PM
Last Post: snippsat
  How to read html tags dynamically generated? amandacstr 5 2,295 Mar-05-2018, 06:07 AM
Last Post: snippsat
  How to print particular text areas fron an HTML file (not site) Chris 10 2,537 Dec-11-2017, 09:20 AM
Last Post: j.crater
  Selenium to pick data from csv and enter into website Prince_Bhatia 1 2,734 Sep-08-2017, 10:58 AM
Last Post: hbknjr

Forum Jump:


Users browsing this thread: 1 Guest(s)