Python Forum
BS4 split text apart from strong tag
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BS4 split text apart from strong tag
#1
Hi All,

I need a minor support to finish my script. I scrap a table from the internet. it has keys like "web" and then the website link and other information.
example of an entry :
   

The target line has this html code :
Output:
<div class="price-data">Sub-Industry<strong>Semiconductors</strong></div>
I reached the line with my code :
tableContent = stockData.find_all(class_='price-data')
print(tableContent .get_text())
The result was like this:

Output:
Sub-IndustrySemiconductors
I got everything in one string which doesn't help much.
I want "Sub-Industry" in an entry and "Semiconductors" in another entry.
I know I can split this some code with string manipulation but the keys in the table are not known.
can BS4 do it?
any support.
Reply
#2
in order to get text from individual elements, you need to get down to the element level.
It's mighty hard to anticipate what your code looks like but you need to find terminal nodes.
Please supply more code, url, etc.
Reply
#3
You can easily get the content of the tag strong
>>> from bs4 import BeautifulSoup as BS
>>> html = '''<div class="price-data">Sub-Industry<strong>Semiconductors</strong></div>'''
>>> soup = BS(html, 'html.parser')
>>> soup.strong.contents
['Semiconductors']
So my first thought would be to parse it out of hte main string.
>>> soup.text.replace(soup.strong.contents[0], '')
'Sub-Industry'
However that wont work if what is in bold is also within the string elsewhere.

You could also try extracting it
>>> soup
<div class="price-data">Sub-Industry<strong>Semiconductors</strong></div>
>>> soup.strong
<strong>Semiconductors</strong>
>>> strong = soup.strong.extract()
>>> strong
<strong>Semiconductors</strong>
>>> soup
<div class="price-data">Sub-Industry</div>
>>> soup.text
'Sub-Industry'
>>> strong
<strong>Semiconductors</strong>
>>> strong.text
'Semiconductors'
This method will does not matter whether there is the same word not in strong tag
>>> html = '''<div class="price-data">Sub-Industry Semiconductors<strong>Semiconductors</strong></div>''' 
>>> soup = BS(html, 'html.parser')
>>> soup
<div class="price-data">Sub-Industry Semiconductors<strong>Semiconductors</strong></div>
>>> strong = soup.strong.extract()
>>> strong
<strong>Semiconductors</strong>
>>> soup.text
'Sub-Industry Semiconductors'
>>> strong.text
'Semiconductors'
The bad to extract is you are actually modifying the tree if you need it later.
Recommended Tutorials:
Reply
#4
(Jul-24-2019, 03:01 AM)metulburr Wrote: You can easily get the content of the tag strong
>>> from bs4 import BeautifulSoup as BS
>>> html = '''<div class="price-data">Sub-Industry<strong>Semiconductors</strong></div>'''
>>> soup = BS(html, 'html.parser')
>>> soup.strong.contents
['Semiconductors']
So my first thought would be to parse it out of hte main string.
>>> soup.text.replace(soup.strong.contents[0], '')
'Sub-Industry'
However that wont work if what is in bold is also within the string elsewhere.

You could also try extracting it
>>> soup
<div class="price-data">Sub-Industry<strong>Semiconductors</strong></div>
>>> soup.strong
<strong>Semiconductors</strong>
>>> strong = soup.strong.extract()
>>> strong
<strong>Semiconductors</strong>
>>> soup
<div class="price-data">Sub-Industry</div>
>>> soup.text
'Sub-Industry'
>>> strong
<strong>Semiconductors</strong>
>>> strong.text
'Semiconductors'
This method will does not matter whether there is the same word not in strong tag
>>> html = '''<div class="price-data">Sub-Industry Semiconductors<strong>Semiconductors</strong></div>''' 
>>> soup = BS(html, 'html.parser')
>>> soup
<div class="price-data">Sub-Industry Semiconductors<strong>Semiconductors</strong></div>
>>> strong = soup.strong.extract()
>>> strong
<strong>Semiconductors</strong>
>>> soup.text
'Sub-Industry Semiconductors'
>>> strong.text
'Semiconductors'
The bad to extract is you are actually modifying the tree if you need it later.
That is very helpful.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020