utf-8 decoding failed every time i try

adnanahsan · Aug-23-2019, 02:07 AM

Hi Guyz,

Whenever i try to decode utf-8 in python3, i always get unicodedecoderror like utf-8 codec cant decode byte at bla bla bla

for example here is a very simple code, it runs perfectly on my windows and i have ubuntu on my vmware on windows pc, so it works on my pc both platforms windows and ubuntu, but on my live server i am getting error. here is example below

      test = ' RelÃ³gio feminino dourado '
      label =  test.encode('latin-1').decode('utf-8')

As soon as it reaches second line it throws error ( Only on servers linux ubuntu )
i am so tired of fighting with this, Kindly help me out. i m using python3 .. my default locale on windows is ('en_US', 'cp1252') .. on vmware ubuntu its utf-8, and on servers ubuntu is also utf-8, but it works on my vmware ubuntu fine, but not on servers. Please help i will be waiting for your reply.
Thanks

justinram11 · Aug-23-2019, 05:41 AM

Hey Adnanahsan,

I'm not exactly sure what you are trying to do, but I think you may be mixing up the purpose of encoding and decoding.

When you "encode" a string, what you are really doing is changing it to a specific set of 1's and 0's.

So for example, a string '³' when encoded into utf-8 produces:

from bitstring import BitArray
test = '³'
encoded = test.encode('utf-8')

print(BitArray(encoded).bin)
1100 0010 1011 0011

While if it's encoded into latin-1 produces:

from bitstring import BitArray
test = '³'
encoded = test.encode('latin-1')

print(BitArray(encoded).bin)
1011 0011

But when you decode something, what you are doing is taking the 1's and 0's and turning them back into actual letters that python can understand. As shown above, however, the 1's and 0's between the utf-8 and latin-1 are not the same.

So what you are doing is taking a string and producing 1's and 0's in the latin-1 format, and then asking python to try and read those 1's and 0's as if they were in the utf-8 format. It can't, however, because the 1's and 0's are not in utf-8 format, they are in latin-1 format

adnanahsan · Aug-23-2019, 10:06 AM

I am getting the string from scrapping webpage, its not returning valid utf-8
In linux, while working fine in windows, so i m trying to encode that garbled string into latin then valid utf-8.

(Aug-23-2019, 05:41 AM)justinram11 Wrote: Hey Adnanahsan,

I'm not exactly sure what you are trying to do, but I think you may be mixing up the purpose of encoding and decoding.

When you "encode" a string, what you are really doing is changing it to a specific set of 1's and 0's.

So for example, a string '³' when encoded into utf-8 produces:
from bitstring import BitArray
test = '³'
encoded = test.encode('utf-8')

print(BitArray(encoded).bin)
1100 0010 1011 0011
While if it's encoded into latin-1 produces:
from bitstring import BitArray
test = '³'
encoded = test.encode('latin-1')

print(BitArray(encoded).bin)
1011 0011
But when you decode something, what you are doing is taking the 1's and 0's and turning them back into actual letters that python can understand. As shown above, however, the 1's and 0's between the utf-8 and latin-1 are not the same.

So what you are doing is taking a string and producing 1's and 0's in the latin-1 format, and then asking python to try and read those 1's and 0's as if they were in the utf-8 format. It can't, however, because the 1's and 0's are not in utf-8 format, they are in latin-1 format

DeaD_EyE · (This post was last modified: Aug-23-2019, 12:09 PM by DeaD_EyE.)

The encoding you're using for your string is broken.
You can fix broken encodings with ftfy.

>>> import ftfy
>>> s = ' RelÃ³gio feminino dourado '
>>> s
' RelÃ³gio feminino dourado '
>>> ftfy.fix_encoding(s)
' Relógio feminino dourado '
>>>

Encoding fixed...
If you try to fix broken encodings without this module, is not easy.

***snippsat*** · (This post was last modified: Aug-23-2019, 12:18 PM by snippsat.)

(Aug-23-2019, 10:06 AM)adnanahsan Wrote: I am getting the string from scrapping webpage, its not returning valid utf-8

You should try to get data with right encoding as the first step correct.
Using eg Requests in combo with BS,will in almost all case get correct encoding.
Requests will return correct site encoding.

>>> import requests

>>> url = 'http://CNN.com'
>>> response = requests.get(url)
>>> response.encoding
'utf-8'

Parsing would look like this.

import requests
from bs4 import BeautifulSoup

url = 'http://CNN.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)

Output:
CNN International - Breaking News, US News, World News and Video

from bs4 import BeautifulSoup
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>Relógio feminino dourado</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
label = soup.find('p')

>>> label
<p>Relógio feminino dourado</p>

>>> label.text
'Relógio feminino dourado'

Remember when have correct value in Python 3 it's Unicode as all string in Python 3 are of this type.
If data comes from utf-8,latin-1..ect dos not matter if look correct in Python 3,
then in and out of Python or server always use utf-8.

adnanahsan · (This post was last modified: Aug-23-2019, 02:29 PM by adnanahsan.)

I tried bro, but it didn't work http://prntscr.com/owkxi3
is it possible that python may fail to correct encoding ??

(Aug-23-2019, 12:09 PM)DeaD_EyE Wrote: The encoding you're using for your string is broken.
You can fix broken encodings with ftfy.
>>> import ftfy
>>> s = ' RelÃ³gio feminino dourado '
>>> s
' RelÃ³gio feminino dourado '
>>> ftfy.fix_encoding(s)
' Relógio feminino dourado '
>>> 
Encoding fixed...
If you try to fix broken encodings without this module, is not easy.

you are right, thats also i wanted to figure out, i am using selenium, grabbing page vide driver.get(url) and that is returning such bad encoded content .. and this problem is only in my server ubuntu, its working fine on my windows and vmware ubuntu on windows pc . but on live server its not working.

(Aug-23-2019, 12:18 PM)snippsat Wrote:
(Aug-23-2019, 10:06 AM)adnanahsan Wrote: I am getting the string from scrapping webpage, its not returning valid utf-8
You should try to get data with right encoding as the first step correct.
Using eg Requests in combo with BS,will in almost all case get correct encoding.
Requests will return correct site encoding.
>>> import requests

>>> url = 'http://CNN.com'
>>> response = requests.get(url)
>>> response.encoding
'utf-8' 
Parsing would look like this.
import requests
from bs4 import BeautifulSoup

url = 'http://CNN.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
Output:
CNN International - Breaking News, US News, World News and Video
from bs4 import BeautifulSoup
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>Relógio feminino dourado</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
label = soup.find('p')
>>> label
<p>Relógio feminino dourado</p>

>>> label.text
'Relógio feminino dourado'
Remember when have correct value in Python 3 it's Unicode as all string in Python 3 are of this type.
If data comes from utf-8,latin-1..ect dos not matter if look correct in Python 3,
then in and out of Python or server always use utf-8.

Here is my source code ..
It works on my windows
It works on my ubuntu installed on vmware on my windows
It doesn't work on my servers Angry

# -*- coding: utf-8 -*-

from seleniumwire import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as ec
import time
import ftfy
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors')
options.add_argument("--headless")
options.add_argument("--window-size=1920x1080")
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
chrome_path = '/var/www/chromedriver'
driver = webdriver.Chrome(chrome_path,  options=options)

driver.get("http://www.correios.com.br/solucoes-empresariais/correios-facil")
driver.implicitly_wait(10)
a = driver.find_elements_by_css_selector("p")
for i in a:
    s = i.text
    s = ftfy.fix_encoding(s)
    print(s)

driver.quit()

PS. it shows correct encoding on browser, but problem is on console / terminal .. i need to match encoded strings for that i need it work on file / terminal

***snippsat*** · Aug-23-2019, 04:27 PM

Is there any reason why you use selenium here?
If i test can get text without,should only use selenium if needed.

import requests
from bs4 import BeautifulSoup

url = "http://www.correios.com.br/solucoes-empresariais/correios-facil"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
texto = soup.find('div', class_="interna-01")
first_p = texto.find('p')

Test:

>>> first_p
<p>Com as soluções de um grande operador logístico, a sua empresa pode se destacar e crescer ainda mais. Fortaleça seu negócio, tornando-se um parceiro dos Correios.</p>


>>> print(first_p.text)
Com as soluções de um grande operador logístico, a sua empresa pode se destacar e crescer ainda mais. Fortaleça seu negócio, tornando-se um parceiro dos Correios.

adnanahsan Wrote:It doesn't work on my servers

What server are you running,are using a Python web-framework eg Flask,Django or something else?

adnanahsan · Aug-23-2019, 04:47 PM

I am using Flask, and server is ubuntu 18 server
regarding selenium, actually i must automate browser, because target site shows data via ajax on user actions.

Problem seems to be with terminal console encoding.. on browser it looks correct, but i need it work on console/terminal too.
any idea why it looks garbled on terminal

(Aug-23-2019, 04:27 PM)snippsat Wrote: Is there any reason why you use selenium here?
If i test can get text without,should only use selenium if needed.
import requests
from bs4 import BeautifulSoup

url = "http://www.correios.com.br/solucoes-empresariais/correios-facil"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
texto = soup.find('div', class_="interna-01")
first_p = texto.find('p')
Test:
>>> first_p
<p>Com as soluções de um grande operador logístico, a sua empresa pode se destacar e crescer ainda mais. Fortaleça seu negócio, tornando-se um parceiro dos Correios.</p>


>>> print(first_p.text)
Com as soluções de um grande operador logístico, a sua empresa pode se destacar e crescer ainda mais. Fortaleça seu negócio, tornando-se um parceiro dos Correios.
adnanahsan Wrote:It doesn't work on my servers
What server are you running,are using a Python web-framework eg Flask,Django or something else?

***snippsat*** · (This post was last modified: Aug-23-2019, 09:28 PM by snippsat.)

(Aug-23-2019, 04:47 PM)adnanahsan Wrote: Problem seems to be with terminal console encoding.

Check your terminal encoding,here some test you can do,i use Linux mint 19 here.

tom@tom:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

tom@tom:~$ python
Python 3.7.3 (default, Apr 17 2019, 11:23:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> 
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
>>> exit()

tom@tom:~$ python -c "import sys; print(sys.stdout.encoding)"
UTF-8
tom@tom:~$ python -c "print('Spicy jalapeño ☂')"
Spicy jalapeño ☂
tom@tom:~$ python -c "print('Relógio feminino dourado')"
Relógio feminino dourado

adnanahsan · Aug-23-2019, 11:54 PM

I am getting exact result as you given.. i think something is wrong with terminal ??? is it possible ? can you try my code on your ubuntu ?
on browser it looks fine, but on terminal some thing is wrong. strange issue i am facing bro :(

for example
i see this " BebÃªs " in terminal instead of " Bebês "
but i see Bebês correctly on browser. but not on my server terminals

(Aug-23-2019, 09:27 PM)snippsat Wrote:

(Aug-23-2019, 04:47 PM)adnanahsan Wrote: Problem seems to be with terminal console encoding.

Check your terminal encoding,here some test you can do,i use Linux mint 19 here.

tom@tom:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

tom@tom:~$ python
Python 3.7.3 (default, Apr 17 2019, 11:23:54) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> 
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'
>>> exit()

tom@tom:~$ python -c "import sys; print(sys.stdout.encoding)"
UTF-8
tom@tom:~$ python -c "print('Spicy jalapeño ☂')"
Spicy jalapeño ☂
tom@tom:~$ python -c "print('Relógio feminino dourado')"
Relógio feminino dourado

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Decoding lat/long in file name	johnmcd	4	1,659	Mar-22-2024, 11:51 AM Last Post: johnmcd
	Enigma Decoding Problem	krisarmstrong	4	2,309	Dec-14-2023, 10:42 AM Last Post: Larz60+
	json decoding error	deneme2	10	8,693	Mar-22-2023, 10:44 PM Last Post: deanhystad
	flask app decoding problem	mesbah	0	3,321	Aug-01-2021, 08:32 PM Last Post: mesbah
	Decoding a serial stream	AKGentile1963	7	13,420	Mar-20-2021, 08:07 PM Last Post: deanhystad
	xml decoding failure(bs4)	roughstroke	1	3,111	May-09-2020, 04:37 PM Last Post: snippsat
	python3 decoding problem but python2 OK	mesbah	0	2,359	Nov-30-2019, 04:42 PM Last Post: mesbah
	hex decoding in Python 3	rdirksen	2	6,098	May-12-2019, 11:49 AM Last Post: rdirksen
	Decoding log files in binary using an XML file.	captainfantastic	1	3,190	Apr-04-2019, 02:24 AM Last Post: captainfantastic
	decoding sub.process output with multiple \n?	searching1	2	3,586	Feb-24-2019, 12:00 AM Last Post: searching1

utf-8 decoding failed every time i try

User Panel Messages

Announcements