utf-8 decoding failed every time i try

adnanahsan · (This post was last modified: Aug-23-2019, 02:29 PM by adnanahsan.)

I tried bro, but it didn't work http://prntscr.com/owkxi3
is it possible that python may fail to correct encoding ??

(Aug-23-2019, 12:09 PM)DeaD_EyE Wrote: The encoding you're using for your string is broken.
You can fix broken encodings with ftfy.
>>> import ftfy
>>> s = ' RelÃ³gio feminino dourado '
>>> s
' RelÃ³gio feminino dourado '
>>> ftfy.fix_encoding(s)
' Relógio feminino dourado '
>>> 
Encoding fixed...
If you try to fix broken encodings without this module, is not easy.

you are right, thats also i wanted to figure out, i am using selenium, grabbing page vide driver.get(url) and that is returning such bad encoded content .. and this problem is only in my server ubuntu, its working fine on my windows and vmware ubuntu on windows pc . but on live server its not working.

(Aug-23-2019, 12:18 PM)snippsat Wrote:
(Aug-23-2019, 10:06 AM)adnanahsan Wrote: I am getting the string from scrapping webpage, its not returning valid utf-8
You should try to get data with right encoding as the first step correct.
Using eg Requests in combo with BS,will in almost all case get correct encoding.
Requests will return correct site encoding.
>>> import requests

>>> url = 'http://CNN.com'
>>> response = requests.get(url)
>>> response.encoding
'utf-8' 
Parsing would look like this.
import requests
from bs4 import BeautifulSoup

url = 'http://CNN.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
Output:
CNN International - Breaking News, US News, World News and Video
from bs4 import BeautifulSoup
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>Relógio feminino dourado</p>
</body>'''

soup = BeautifulSoup(html, 'lxml')
label = soup.find('p')
>>> label
<p>Relógio feminino dourado</p>

>>> label.text
'Relógio feminino dourado'
Remember when have correct value in Python 3 it's Unicode as all string in Python 3 are of this type.
If data comes from utf-8,latin-1..ect dos not matter if look correct in Python 3,
then in and out of Python or server always use utf-8.

Here is my source code ..
It works on my windows
It works on my ubuntu installed on vmware on my windows
It doesn't work on my servers Angry

# -*- coding: utf-8 -*-

from seleniumwire import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as ec
import time
import ftfy
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors')
options.add_argument("--headless")
options.add_argument("--window-size=1920x1080")
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
chrome_path = '/var/www/chromedriver'
driver = webdriver.Chrome(chrome_path,  options=options)

driver.get("http://www.correios.com.br/solucoes-empresariais/correios-facil")
driver.implicitly_wait(10)
a = driver.find_elements_by_css_selector("p")
for i in a:
    s = i.text
    s = ftfy.fix_encoding(s)
    print(s)

driver.quit()

PS. it shows correct encoding on browser, but problem is on console / terminal .. i need to match encoded strings for that i need it work on file / terminal

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Decoding lat/long in file name	johnmcd	4	546	Mar-22-2024, 11:51 AM Last Post: johnmcd
	Enigma Decoding Problem	krisarmstrong	4	911	Dec-14-2023, 10:42 AM Last Post: Larz60+
	json decoding error	deneme2	10	4,112	Mar-22-2023, 10:44 PM Last Post: deanhystad
	flask app decoding problem	mesbah	0	2,443	Aug-01-2021, 08:32 PM Last Post: mesbah
	Decoding a serial stream	AKGentile1963	7	9,016	Mar-20-2021, 08:07 PM Last Post: deanhystad
	xml decoding failure(bs4)	roughstroke	1	2,351	May-09-2020, 04:37 PM Last Post: snippsat
	python3 decoding problem but python2 OK	mesbah	0	1,864	Nov-30-2019, 04:42 PM Last Post: mesbah
	hex decoding in Python 3	rdirksen	2	4,725	May-12-2019, 11:49 AM Last Post: rdirksen
	Decoding log files in binary using an XML file.	captainfantastic	1	2,514	Apr-04-2019, 02:24 AM Last Post: captainfantastic
	decoding sub.process output with multiple \n?	searching1	2	2,895	Feb-24-2019, 12:00 AM Last Post: searching1

utf-8 decoding failed every time i try

User Panel Messages

Announcements