Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
unicode symbol processing
#1
Hello,

While scarping web page I've faced problem of recognized unicode symbols.
Here is original string:
Output:
978-1-4419-5905-8
Here is how it looks in read page:
Output:
----
And here is output when I execute text[ind0:ind1]:
Output:
\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640
So I have couple of questions:
  1. How to detect that a particular fragment of text is not ASCII coded ?
  2. How to convert it in ASCII ?
Thanks.
Reply
#2
What to you use scrape this,Requests,Beautifulsoup?
Can we test this,eg give address to url.
(Dec-03-2019, 01:38 PM)Pavel_47 Wrote: And here is output when I execute text[ind0:ind1]:
The error has already happen,so no point in doing this.
>>> d = '----'
>>> d
'\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640
Pavel_47 Wrote:How to convert it in ASCII ?
When do web-scraping is almost always Unicode you work with.
BeautifulSoup Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode Dammit to detect a document’s encoding and convert it to Unicode.
The document is converted to Unicode,and HTML entities are converted to Unicode characters.
Python 3 has full Unicode support(all text is Unicode),so writing Unicode in and out of Python should always use utf-8 .
s = "สวัสดีชาวโลก!"
>>> with open('uni.txt', 'w', encoding='utf-8') as f_out:
...     f_out.write(s)
...     

>>> with open('uni.txt', encoding='utf-8') as f:
...     print(f.read())
... 
Output:
สวัสดีชาวโลก!
Reply
#3
Sorry, I've been mistaken while formulating the problem.
The problematic string comes from pdf file (ISBN code).
Then I use this string for scarping.
So, the problem consists in recognizing that ISBN is coded in a system, that isn't ASCII and then convert it to ASCII.
Reply
#4
You can map it. But numbers are still missing.
I guess there is a better solution. Maybe it's a special encoding or fancy unicode stuff.


result = '----'
isbn = '978-1-4419-5905-8'

translation = {ord(c): ord(m) for c,m in zip(result, isbn)}


print(result.translate(translation))
Output:
978-1-4419-5905-8
If you collect more data you can compare and verify.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
Frankly, I did not understand your solution.
Apriory I do not know ISBN: I extract it from a pdf file.
In most cases it is ASCII encoded, but there are cases where the encoding is not ASCII.
In the example above, I opened the pdf file in the reader and saw the ISBN.
I can not do it for each file ... otherwise, Python automation becomes useless.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020