unicode symbol processing - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: unicode symbol processing (/thread-22922.html) |
unicode symbol processing - Pavel_47 - Dec-03-2019 Hello, While scarping web page I've faced problem of recognized unicode symbols. Here is original string: Here is how it looks in read page: And here is output when I execute text[ind0:ind1]: So I have couple of questions:
RE: unicode symbol processing - snippsat - Dec-03-2019 What to you use scrape this,Requests,Beautifulsoup? Can we test this,eg give address to url. (Dec-03-2019, 01:38 PM)Pavel_47 Wrote: And here is output when I execute text[ind0:ind1]:The error has already happen,so no point in doing this. >>> d = '----' >>> d '\uf641\uf63f\uf640-\uf6dc-\uf63c\uf63c\uf6dc\uf641-\uf63d\uf641\uf639\uf63d-\uf640 Pavel_47 Wrote:How to convert it in ASCII ?When do web-scraping is almost always Unicode you work with. BeautifulSoup Wrote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.Python 3 has full Unicode support(all text is Unicode),so writing Unicode in and out of Python should always use utf-8 .s = "สวัสดีชาวโลก!" >>> with open('uni.txt', 'w', encoding='utf-8') as f_out: ... f_out.write(s) ... >>> with open('uni.txt', encoding='utf-8') as f: ... print(f.read()) ...
RE: unicode symbol processing - Pavel_47 - Dec-03-2019 Sorry, I've been mistaken while formulating the problem. The problematic string comes from pdf file (ISBN code). Then I use this string for scarping. So, the problem consists in recognizing that ISBN is coded in a system, that isn't ASCII and then convert it to ASCII. RE: unicode symbol processing - DeaD_EyE - Dec-04-2019 You can map it. But numbers are still missing. I guess there is a better solution. Maybe it's a special encoding or fancy unicode stuff. result = '----' isbn = '978-1-4419-5905-8' translation = {ord(c): ord(m) for c,m in zip(result, isbn)} print(result.translate(translation)) If you collect more data you can compare and verify.
RE: unicode symbol processing - Pavel_47 - Dec-04-2019 Frankly, I did not understand your solution. Apriory I do not know ISBN: I extract it from a pdf file. In most cases it is ASCII encoded, but there are cases where the encoding is not ASCII. In the example above, I opened the pdf file in the reader and saw the ISBN. I can not do it for each file ... otherwise, Python automation becomes useless. |