Problem with character sets

Pedroski55 · Mar-03-2019, 12:31 AM

Yesterday I asked about getting data from a webpage, got some good advice and had a little success. However, there is a problem with the character sets.

If I look at the source code of the starting webpage, it has:

Quote:<meta http-equiv=Content-Type content=text/html;charset=gb2312>

From the source code view, I clicked my way through the links till I found what I wanted. I presume all subordinate webpages are then also GB2312

I got the first set of data, about 466 lines, with:

Quote:line = soup.find('table').text

When I saved the data I retrieved as a text file, what should be Chinese characters, which I can see in Firefox, end up looking like hieroglyphics in the text file I save (small sample here):

Quote:×¨Òµ
ÆÚÊý
ÐÕÃû
ÐÔ±ð
ÊÖ»úºÅÂë
Éí·ÝÖ¤ºÅ
µÇÂ½ÃÜÂë
Ñ§ºÅ
²é¿´
ÐÞ¸Ä
É¾³ý

Numbers display correctly.

How can I:
A. convert line directly to UTF-8 or
B. tell Python to write this data to a text file encoded GB2312?

I tried Linux command line iconv on the text file, but just get errors, same as with utf8trans.

Thanks for any tips!

Pedroski55 · Mar-03-2019, 11:54 PM

Found an answer for Python:

data = '»Æ¹ûÊ÷'
data.encode('latin1').decode('gb2312')
'黄果树'

***snippsat*** · (This post was last modified: Mar-04-2019, 12:25 AM by snippsat.)

Pedroski55 Wrote:When I saved the data I retrieved as a text file, what should be Chinese characters,

Have to careful to keep Unicode use utf-8,when take text out of Python 3.
Example Requests and BeautifulSoup will keep correct encoding from a web-site.

from bs4 import BeautifulSoup
import requests

url = 'http://www.sohu.com'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select('div.news > p:nth-child(1) > a')

Test:

>>> text
[<a data-param="&amp;_f=index_cpc_0" href="http://www.sohu.com/a/298818150_428290?g=0?code=36b1c5f548e7c32034c382e96f3e401" target="_blank" title="全国政协十三届二次会议在京开幕">全国政协十三届二次会议在京开幕</a>]

>>> text[0].attrs['title']
'全国政协十三届二次会议在京开幕'

Saving to disk i do not need to use gb2312,always utf-8 when Unicode show correct in Python 3.
Unicode improvement was one biggest change moving from Python 2 to 3.

# Write to disk
ch = '全国政协十三届二次会议在京开幕'
with open('ch.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(ch)

# Read from disk
with open('ch.txt', encoding='utf-8') as f:
    print(f.read())

In and out still correct:

Output:
全国政协十三届二次会议在京开幕

Pedroski55 · Mar-04-2019, 02:09 AM

Thanks again!

I did not set any encoding for my text file. I just did:

Quote:line = soup.find('table').text

then:

Quote:file = open(path + 'page1', 'w')
file.write(line)
file.close

I think that line was encoded GB2312, but I am not really sure. I also think, Python saves as UTF-8 by default. So I had a text encoded GB2312, saved as UTF-8. A mess!

This works, I can see the Chinese.

Quote:data = '»Æ¹ûÊ÷'
data.encode('latin1').decode('gb2312')
'黄果树'

How should I actually save line??

***snippsat*** · Mar-04-2019, 02:35 AM

(Mar-04-2019, 02:09 AM)Pedroski55 Wrote: file = open(path + 'page1', 'w')

You have to set encoding like i did in my post,and use with open then no need to use close()

file = open(path + 'page1', 'w', encoding='utf-8')

Quote:I also think, Python saves as UTF-8 by default

No this can fail in many ways depend on environment(editor/Treminal,ect..) or OS,
so set always encoding especially if you get wrong result.

Quote:How should I actually save line??

Should maybe not needed if do set encoding as shown over.
Basic stuff you use a variable,then eg save/read as i show in my post

data = '»Æ¹ûÊ÷'
text_ch = data.encode('latin1').decode('gb2312')

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Peculiar pattern from printing of sets	SahandJ	7	1,659	Dec-29-2021, 06:31 PM Last Post: bowlofred
	How does one combine 2 data sets ?	detlefschmitt	2	1,688	Sep-03-2021, 03:38 AM Last Post: detlefschmitt
	[solved] unexpected character after line continuation character	paul18fr	4	3,405	Jun-22-2021, 03:22 PM Last Post: deanhystad
	Looping Through Large Data Sets	JoeDainton123	10	4,372	Oct-18-2020, 02:58 PM Last Post: buran
	comprehension for sets	Skaperen	2	1,869	Aug-07-2020, 10:12 PM Last Post: Skaperen
	SyntaxError: unexpected character after line continuation character	siteshkumar	2	3,182	Jul-13-2020, 07:05 PM Last Post: snippsat
	how can i handle "expected a character " type error , when I input no character	vivekagrey	2	2,741	Jan-05-2020, 11:50 AM Last Post: vivekagrey
	Sort sets by item values	Sergey	4	69,431	Apr-19-2019, 10:50 AM Last Post: Sergey
	Replace changing string including uppercase character with lowercase character	silfer	11	6,198	Mar-25-2019, 12:54 PM Last Post: silfer
	merge 3 sql data sets to 1 librairy	brecht83	0	2,114	Sep-26-2018, 10:13 PM Last Post: brecht83

Problem with character sets

User Panel Messages

Announcements