Python Forum
encoding issiue using requests
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
encoding issiue using requests
#1
Hi all,
I'm trying to get info from site with following code. I'm getting this:
['Gesellschafter/in', u'Vorsitzende/r der Gesch\xe4ftsf\xfchrung']
and expect it to be:
['Gesellschafter/in', u'Vorsitzende/r der Geschäftsführung']

import requests
from lxml import html

CHKURL	= "http://www.monetas.ch/htm/653/de/Aktuelles-Management.htm?subj=2519858"
XPATH	= ".//*[@id='content']/table/tbody/tr/td[2]//text()"

def urlparse(url):
	url = url.strip()
	response = requests.get(url)
	parsed = html.fromstring(response.text)
	return parsed
	
xp = urlparse(CHKURL).xpath(XPATH)
print xp
where am I wrong?

thx in advance
Reply
#2
Nothing is wrong,is the way Python 2 handle Unicode in a list.
print and it magically work.
>>> lst = ['Gesellschafter/in', u'Vorsitzende/r der Gesch\xe4ftsf\xfchrung']
>>> print(lst[1])
Vorsitzende/r der Geschäftsführung
Python 3 has big changes in Unicode,and you should use Python 3 not 2.
So in Python 3 output look like this.
Output:
['Gesellschafter/in', 'Vorsitzende/r der Geschäftsführung']
Reply
#3
When I print it works, but when I write result to CSV file it's a mess again.

PS: Thanks for quick reply! I'll switch to P3 in near future. But for current project need to use P2.
Reply
#4
(Jan-21-2018, 04:39 PM)dmbest Wrote: When I print it works, but when I write result to CSV file it's a mess again.
Always try to use utf-8 in and out when working with files.
Example Python 2.7:
# -*- coding: utf-8 -*-
import io

lst = ['Gesellschafter/in', u'Vorsitzende/r der Gesch\xe4ftsf\xfchrung']
with io.open('out.csv', 'w', encoding='utf-8') as f:
    f.write(', '.join(lst))
Output:
Gesellschafter/in, Vorsitzende/r der Geschäftsführung
Reply
#5
Don't use Python 2.7
Use Python 3.6+

Also the sourcecode is decoded as UTF8 by default with Python 3:
lst = ['Gesellschafter/in', 'Vorsitzende/r der Geschäftsführung']

Side-Effect: You can even use german variable names:
verhör = 42
# or chinese?
谢谢 = 'Danke'
If you open files in text mode, the default encoding is UTF-8. As described before, you can define the encoding of text.
Sometimes there are other encodings used like: latin1, cp850, etc.
You'll find very often csv-files with encodings other than utf-8.
If you don't know an encoding and hate guessing, you should look for this module: https://ftfy.readthedocs.io/en/latest/
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
(Jan-22-2018, 12:04 AM)DeaD_EyE Wrote: Don't use Python 2.7
Read what he post.
(Jan-21-2018, 04:39 PM)dmbest Wrote: But for current project need to use P2.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020