Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Weird characters scraped
#1
Hi i am trying to scrape webpage but in names of products i get weird characters, i will show what is in webpage sourcode and what beautiful soup scrape for me...

This is how it looks like in web page trouught inspect:
<strong>Clamps P&S Black 100</strong>
And this is what beautifulsoup scrapes:
<strong>Clapms P&amp;S Black 100</strong>
How to get it right ?

Thank you
Reply
#2
Hi @samuelbachorik,

&amp; is not a "weird" character it's an "HTML character entity references" .. Damn that a long name...
https://en.wikipedia.org/wiki/List_of_XM...references

If you plan to use the extracted data outside HTML then you could convert those character entity references into utf-8

Cheers,
[Image: NfRQr9R.jpg]
Reply
#3
When use .text to get content out of tag it will be ok.
from bs4 import BeautifulSoup

html = '<strong>Clamps P&S Black 100</strong>'
soup = BeautifulSoup(html, 'lxml')
tag = soup.select_one('strong')
>>> tag
<strong>Clamps P&amp;S Black 100</strong>
>>> tag.text
'Clamps P&S Black 100'
Reply
#4
With stdlib from Python:
import html


s = "<strong>Clamps P&amp;S Black 100</strong>"
text = html.unescape(s)

print(text)
Output:
<strong>Clamps P&S Black 100</strong>
But BeautifulSoup does it already. You should use this 3rd party library.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraper not populating .txt with scraped data BlackHeart 5 1,522 Apr-03-2023, 05:12 PM
Last Post: snippsat
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,238 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,735 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,479 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 2,440 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Normalizig scraped text wuggs 3 2,558 Jan-07-2020, 03:32 AM
Last Post: Larz60+
  Parsing infor from scraped files. Larz60+ 2 3,662 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  beautiful soup - parsing scraped code in a script lilbigwill99 2 3,253 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99
  Need Tip On Cleaning My BS4 Scraped Data digitalmatic7 2 3,232 Jan-29-2018, 08:49 PM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020