Getting a specific text inside an html with soup

mathieugrimbert · Jul-08-2019, 01:19 PM

Hi, I apologies for the question but I am new to scrapping in python and I struggle with accessing a text inside an html. I passed the article/html through the soup but I haven't succeed in getting the text (in bold). I tried children,comments and different type of navigable string but the best I could get was getting "Google" when I am trying to use the below

link = soup.find_all('p')[i]
            article_body.append(link.string)

Thanks in advance for the help. Any suggestion would be very much appreciated

the html code below

<div class="o-teaser o-teaser--article o-teaser--small o-teaser--has-image js-teaser" data-id="3bbb6fec-88c5-11e9-a028-86cea8523dc2">
<div class="o-teaser__content">
<div class="o-teaser__meta">
<div class="o-teaser__meta-tag">
<a class="o-teaser__tag" data-trackable="teaser-tag" href="/stream/254cd19f-4724-4c89-9230-926e8201a823">Huawei Technologies Co Ltd</a>
</div>
</div>
<div class="o-teaser__heading">
<a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/3bbb6fec-88c5-11e9-a028-86cea8523dc2">
<span>
<mark class="search-item__highlight">Google</mark> warns of US national security risks from Huawei ban
</span>
</a>
</div>
<p class="o-teaser__standfirst">
<a class="js-teaser-standfirst-link" data-trackable="standfirst-link" href="/content/3bbb6fec-88c5-11e9-a028-86cea8523dc2" tabindex="-1">
<span>
...
<mark class="search-item__highlight">Google</mark> has warned the Trump administration it risks compromising US national security if it pushes ahead with sweeping export restrictions on Huawei, as the technology group seeks to continue doing...
</span>
</a></p><div class="o-teaser__timestamp">
<time class="o-teaser__timestamp-date" datetime="2019-06-07T03:36:51+0000">June 7, 2019</time>

mathieugrimbert · Jul-08-2019, 04:30 PM

To summarise, my issue is that I understand how to look for the class "mark" but I don't know how to look for /mark inside that class. Thank you in advance for any tips

***snippsat*** · Jul-08-2019, 05:06 PM

How it works.

from bs4 import BeautifulSoup

html = '''\
<span><mark class="search-item__highlight">Google</mark> has warned the Trump administration</span>'''

soup = BeautifulSoup(html, 'lxml')

Use:

>>> span_tag = soup.find('span')
>>> span_tag
<span><mark class="search-item__highlight">Google</mark> has warned the Trump administration</span>

>>> span_tag.text
'Google has warned the Trump administration'

Span tag .text has has output as show over,only that Google is highlighted when html is rendered.
To find mark tag.

>>> mark_tag = span_tag.find('mark')
>>> mark_tag
<mark class="search-item__highlight">Google</mark>

>>> mark_tag.text
'Google'

# The CSS class name can be found with attrs
>>> mark_tag.attrs
{'class': ['search-item__highlight']}

mathieugrimbert · Jul-09-2019, 10:34 AM

Thank you snippsat. That is helpful. I still have two little issues

1) I am doing that inside a loop and trying to capture all the 'span'. When I tried to put that in an array with the below I get something different from the text. Although if I just print link.text I get the same text as you

link = soup.find_all('span')[i]
            article_body.append(link.text)

2)How can I get two loops (or use two criteria) for soup.findAll? I need to be able to select the 'span' inside the 'a' class?

Thank you very much in advance for your help!

***snippsat*** · Jul-09-2019, 11:01 AM

(Jul-09-2019, 10:34 AM)mathieugrimbert Wrote: I need to be able to select the 'span' inside the 'a' class?

CSS selector works in BS,bye using select()
This will all get all <span> that inside of <a>.

span_tag = soup.select('a span')

mathieugrimbert · Jul-09-2019, 11:27 AM

Thanks snippsat. Although it seems to be selected the 'span' inside the first 'a' only. What I am trying to do is to get a list of all the 'span' inside every 'a' (I normally have only have one 'span' per 'a')

Thanks in advance for your help!

***snippsat*** · Jul-09-2019, 11:51 AM

Quote:Thanks snippsat. Although it seems to be selected the 'span' inside the first 'a' only.

It should select all,quick test with code you posted.

from bs4 import BeautifulSoup

html = '''\
<div class="o-teaser o-teaser--article o-teaser--small o-teaser--has-image js-teaser" data-id="3bbb6fec-88c5-11e9-a028-86cea8523dc2">
<div class="o-teaser__content">
<div class="o-teaser__meta">
<div class="o-teaser__meta-tag">
<a class="o-teaser__tag" data-trackable="teaser-tag" href="/stream/254cd19f-4724-4c89-9230-926e8201a823">Huawei Technologies Co Ltd</a>
</div>
</div>
<div class="o-teaser__heading">
<a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/3bbb6fec-88c5-11e9-a028-86cea8523dc2">
<span>
<mark class="search-item__highlight">Google</mark> warns of US national security risks from Huawei ban
</span>
</a>
</div>
<p class="o-teaser__standfirst">
<a class="js-teaser-standfirst-link" data-trackable="standfirst-link" href="/content/3bbb6fec-88c5-11e9-a028-86cea8523dc2" tabindex="-1">
<span>
...
<mark class="search-item__highlight">Google</mark> has warned the Trump administration it risks compromising US national security if it pushes ahead with sweeping export restrictions on Huawei, as the technology group seeks to continue doing...
</span>
</a></p><div class="o-teaser__timestamp">
<time class="o-teaser__timestamp-date" datetime="2019-06-07T03:36:51+0000">June 7, 2019</time>'''

soup = BeautifulSoup(html, 'lxml')

Test:

>>> span_tag = soup.select('a span')
>>> len(span_tag)
2

>>> span_tag[0]
<span>
<mark class="search-item__highlight">Google</mark> warns of US national security risks from Huawei ban
</span>

>>> for tag in span_tag:
...     print(tag.text.strip())
...     
Google warns of US national security risks from Huawei ban
...
Google has warned the Trump administration it risks compromising US national security if it pushes ahead with sweeping export restrictions on Huawei, as the technology group seeks to continue doing...

mathieugrimbert · Jul-09-2019, 01:27 PM

Thank you snippsat. The last issue I have now is that I am not sure how I can store those texts into an array or a dataframe?

article_body.append(tag.text.strip())

gives me <built-in method strip of unicode object at 0x0000000008E22BD0>

Thanks again for your help!

***snippsat*** · (This post was last modified: Jul-09-2019, 05:58 PM by snippsat.)

What's article_body empty list or do it contain something?
It work fine i i test with code over like this.

>>> span_tag = soup.select('a span')
>>> article_body = []
>>> for tag in span_tag:
...     article_body.append(tag.text.strip())
...     
>>> len(article_body)
2
>>> article_body[0]
'Google warns of US national security risks from Huawei ban'

mathieugrimbert · (This post was last modified: Jul-10-2019, 01:00 PM by mathieugrimbert.)

Thank you snippsat. It was due to another issue somewhere else. Now the only thing I have left to do is to find a way to filter some of the data inside article_body (or tag.text.strip()) as with a span I am still picking a few data I don't need (I only sent you an extract of the html). Any advise would be appreciated!

Thanks again for your help

And finally to remove all the unicode characters if you have a smart way of doing that

Thanks again snippsat

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Obstacles \| Karate \| HTML/Scrape Specific Tag and Store it in MariaDB	BrandonKastning	8	3,169	Nov-22-2021, 01:38 AM Last Post: BrandonKastning
	How to get specific TD text via Selenium?	euras	3	8,810	May-14-2021, 05:12 PM Last Post: snippsat
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,636	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	3,472	Nov-02-2020, 08:12 PM Last Post: Larz60+
	Help: Beautiful Soup - Parsing HTML table	ironfelix717	2	2,692	Oct-01-2020, 02:19 PM Last Post: snippsat
	Beautiful Soup (suddenly) doesn't get full webpage html	j.crater	8	16,876	Jul-11-2020, 04:31 PM Last Post: j.crater
	Requests-HTML vs Beautiful Soup - How to Choose?	robin73	0	3,823	Jun-23-2020, 02:53 PM Last Post: robin73
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,365	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	How to get the href value of a specific word in the html code	julio2000	2	3,209	Mar-05-2020, 07:50 PM Last Post: julio2000
	Web crawler extracting specific text from HTML	lewdow	1	3,404	Jan-03-2020, 11:21 PM Last Post: snippsat

Getting a specific text inside an html with soup

User Panel Messages

Announcements