unicode symbol processing

Pavel_47 · Dec-03-2019, 01:38 PM

Hello,

While scarping web page I've faced problem of recognized unicode symbols.
Here is original string:

Output:
978-1-4419-5905-8

Here is how it looks in read page:

Output:
----

And here is output when I execute text[ind0:ind1]:

Output:
\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640

So I have couple of questions:

How to detect that a particular fragment of text is not ASCII coded ?
How to convert it in ASCII ?

Thanks.

***snippsat*** · (This post was last modified: Dec-03-2019, 03:28 PM by snippsat.)

What to you use scrape this,Requests,Beautifulsoup?
Can we test this,eg give address to url.

(Dec-03-2019, 01:38 PM)Pavel_47 Wrote: And here is output when I execute text[ind0:ind1]:

The error has already happen,so no point in doing this.

>>> d = '----'
>>> d
'\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640

Pavel_47 Wrote:How to convert it in ASCII ?

When do web-scraping is almost always Unicode you work with.

BeautifulSoup Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode.
Beautiful Soup uses a sub-library called Unicode Dammit to detect a document’s encoding and convert it to Unicode.
The document is converted to Unicode,and HTML entities are converted to Unicode characters.

Python 3 has full Unicode support(all text is Unicode),so writing Unicode in and out of Python should always use utf-8 .

s = "สวัสดีชาวโลก!"
>>> with open('uni.txt', 'w', encoding='utf-8') as f_out:
...     f_out.write(s)
...     

>>> with open('uni.txt', encoding='utf-8') as f:
...     print(f.read())
...

Output:
สวัสดีชาวโลก!

Pavel_47 · Dec-03-2019, 07:33 PM

Sorry, I've been mistaken while formulating the problem.
The problematic string comes from pdf file (ISBN code).
Then I use this string for scarping.
So, the problem consists in recognizing that ISBN is coded in a system, that isn't ASCII and then convert it to ASCII.

DeaD_EyE · Dec-04-2019, 01:31 AM

You can map it. But numbers are still missing.
I guess there is a better solution. Maybe it's a special encoding or fancy unicode stuff.

result = '----'
isbn = '978-1-4419-5905-8'

translation = {ord(c): ord(m) for c,m in zip(result, isbn)}


print(result.translate(translation))

Output:
978-1-4419-5905-8

If you collect more data you can compare and verify.

Pavel_47 · Dec-04-2019, 09:43 AM

Frankly, I did not understand your solution.
Apriory I do not know ISBN: I extract it from a pdf file.
In most cases it is ASCII encoded, but there are cases where the encoding is not ASCII.
In the example above, I opened the pdf file in the reader and saw the ISBN.
I can not do it for each file ... otherwise, Python automation becomes useless.

unicode symbol processing

User Panel Messages

Announcements