Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Problem with character sets
Yesterday I asked about getting data from a webpage, got some good advice and had a little success. However, there is a problem with the character sets.

If I look at the source code of the starting webpage, it has:

Quote:<meta http-equiv=Content-Type content=text/html;charset=gb2312>

From the source code view, I clicked my way through the links till I found what I wanted. I presume all subordinate webpages are then also GB2312

I got the first set of data, about 466 lines, with:

Quote:line = soup.find('table').text

When I saved the data I retrieved as a text file, what should be Chinese characters, which I can see in Firefox, end up looking like hieroglyphics in the text file I save (small sample here):


Numbers display correctly.

How can I:
A. convert line directly to UTF-8 or
B. tell Python to write this data to a text file encoded GB2312?

I tried Linux command line iconv on the text file, but just get errors, same as with utf8trans.

Thanks for any tips!
Found an answer for Python:

data = '»Æ¹ûÊ÷'
Pedroski55 Wrote:When I saved the data I retrieved as a text file, what should be Chinese characters,
Have to careful to keep Unicode use utf-8,when take text out of Python 3.
Example Requests and BeautifulSoup will keep correct encoding from a web-site.
from bs4 import BeautifulSoup
import requests

url = ''
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text =' > p:nth-child(1) > a')
>>> text
[<a data-param="&amp;_f=index_cpc_0" href="" target="_blank" title="全国政协十三届二次会议在京开幕">全国政协十三届二次会议在京开幕</a>]

>>> text[0].attrs['title']
Saving to disk i do not need to use gb2312,always utf-8 when Unicode show correct in Python 3.
Unicode improvement was one biggest change moving from Python 2 to 3.
# Write to disk
ch = '全国政协十三届二次会议在京开幕'
with open('ch.txt', 'w', encoding='utf-8') as f_out:

# Read from disk
with open('ch.txt', encoding='utf-8') as f:
In and out still correct:
Thanks again!

I did not set any encoding for my text file. I just did:

Quote:line = soup.find('table').text


Quote:file = open(path + 'page1', 'w')

I think that line was encoded GB2312, but I am not really sure. I also think, Python saves as UTF-8 by default. So I had a text encoded GB2312, saved as UTF-8. A mess!

This works, I can see the Chinese.

Quote:data = '»Æ¹ûÊ÷'

How should I actually save line??
(Mar-04-2019, 02:09 AM)Pedroski55 Wrote: file = open(path + 'page1', 'w')
You have to set encoding like i did in my post,and use with open then no need to use close()
file = open(path + 'page1', 'w', encoding='utf-8')
Quote:I also think, Python saves as UTF-8 by default
No this can fail in many ways depend on environment(editor/Treminal,ect..) or OS,
so set always encoding especially if you get wrong result.

Quote:How should I actually save line??
Should maybe not needed if do set encoding as shown over.
Basic stuff you use a variable,then eg save/read as i show in my post
data = '»Æ¹ûÊ÷'
text_ch = data.encode('latin1').decode('gb2312')
Pedroski55 likes this post

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  SyntaxError: invalid character in identifier neogeo 2 125 Jul-27-2019, 11:11 AM
Last Post: neogeo
  SyntaxError: invalid character in identifier ricardodepaula 2 118 Jul-25-2019, 09:20 PM
Last Post: ricardodepaula
  split by character class Skaperen 3 214 Jul-15-2019, 02:29 AM
Last Post: Skaperen
  Moving to the next character 357mag 2 144 Jul-05-2019, 10:26 AM
Last Post: snippsat
  Remove \n at the end of a character from a list judkil 2 171 Jun-24-2019, 12:15 AM
Last Post: DeaD_EyE
  Error when entering letter/character instead of number/integer helplessnoobb 2 331 Jun-22-2019, 07:15 AM
Last Post: ThomasL
  the next higher character Skaperen 13 598 Jun-07-2019, 01:44 PM
Last Post: heiner55
  Find string and add character - newbi PyDK 1 153 May-15-2019, 01:22 PM
Last Post: ichabod801
  Sort sets by item values Sergey 4 288 Apr-19-2019, 10:50 AM
Last Post: Sergey
  remove string character from url jacklee26 10 679 Mar-25-2019, 03:56 PM
Last Post: Larz60+

Forum Jump:

Users browsing this thread: 1 Guest(s)