unicode symbol processing - Printable Version

unicode symbol processing - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: unicode symbol processing (/thread-22922.html)

unicode symbol processing - Pavel_47 - Dec-03-2019

Hello,

While scarping web page I've faced problem of recognized unicode symbols.
Here is original string:

Output:
978-1-4419-5905-8

Here is how it looks in read page:

Output:
----

And here is output when I execute text[ind0:ind1]:

Output:
\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640

So I have couple of questions:

How to detect that a particular fragment of text is not ASCII coded ?
How to convert it in ASCII ?

Thanks.

RE: unicode symbol processing - snippsat - Dec-03-2019

What to you use scrape this,Requests,Beautifulsoup?
Can we test this,eg give address to url.

(Dec-03-2019, 01:38 PM)Pavel_47 Wrote: And here is output when I execute text[ind0:ind1]:

The error has already happen,so no point in doing this.

>>> d = '----'
>>> d
'\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640

Pavel_47 Wrote:How to convert it in ASCII ?

When do web-scraping is almost always Unicode you work with.

BeautifulSoup Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode Dammit to detect a document’s encoding and convert it to Unicode.
The document is converted to Unicode,and HTML entities are converted to Unicode characters.

Python 3 has full Unicode support(all text is Unicode),so writing Unicode in and out of Python should always use utf-8 .

s = "สวัสดีชาวโลก!"
>>> with open('uni.txt', 'w', encoding='utf-8') as f_out:
...     f_out.write(s)
...     

>>> with open('uni.txt', encoding='utf-8') as f:
...     print(f.read())
...

Output:
สวัสดีชาวโลก!

RE: unicode symbol processing - Pavel_47 - Dec-03-2019

Sorry, I've been mistaken while formulating the problem.
The problematic string comes from pdf file (ISBN code).
Then I use this string for scarping.
So, the problem consists in recognizing that ISBN is coded in a system, that isn't ASCII and then convert it to ASCII.

RE: unicode symbol processing - DeaD_EyE - Dec-04-2019

You can map it. But numbers are still missing.
I guess there is a better solution. Maybe it's a special encoding or fancy unicode stuff.

result = '----'
isbn = '978-1-4419-5905-8'

translation = {ord(c): ord(m) for c,m in zip(result, isbn)}


print(result.translate(translation))

Output:
978-1-4419-5905-8

If you collect more data you can compare and verify.

RE: unicode symbol processing - Pavel_47 - Dec-04-2019

Frankly, I did not understand your solution.
Apriory I do not know ISBN: I extract it from a pdf file.
In most cases it is ASCII encoded, but there are cases where the encoding is not ASCII.
In the example above, I opened the pdf file in the reader and saw the ISBN.
I can not do it for each file ... otherwise, Python automation becomes useless.