Mar-03-2019, 12:31 AM
Yesterday I asked about getting data from a webpage, got some good advice and had a little success. However, there is a problem with the character sets.
If I look at the source code of the starting webpage, it has:
From the source code view, I clicked my way through the links till I found what I wanted. I presume all subordinate webpages are then also GB2312
I got the first set of data, about 466 lines, with:
When I saved the data I retrieved as a text file, what should be Chinese characters, which I can see in Firefox, end up looking like hieroglyphics in the text file I save (small sample here):
Numbers display correctly.
How can I:
A. convert line directly to UTF-8 or
B. tell Python to write this data to a text file encoded GB2312?
I tried Linux command line iconv on the text file, but just get errors, same as with utf8trans.
Thanks for any tips!
If I look at the source code of the starting webpage, it has:
Quote:<meta http-equiv=Content-Type content=text/html;charset=gb2312>
From the source code view, I clicked my way through the links till I found what I wanted. I presume all subordinate webpages are then also GB2312
I got the first set of data, about 466 lines, with:
Quote:line = soup.find('table').text
When I saved the data I retrieved as a text file, what should be Chinese characters, which I can see in Firefox, end up looking like hieroglyphics in the text file I save (small sample here):
Quote:רҵ
ÆÚÊý
ÐÕÃû
ÐÔ±ð
ÊÖ»úºÅÂë
Éí·ÝÖ¤ºÅ
µÇ½ÃÜÂë
ѧºÅ
²é¿´
ÐÞ¸Ä
ɾ³ý
Numbers display correctly.
How can I:
A. convert line directly to UTF-8 or
B. tell Python to write this data to a text file encoded GB2312?
I tried Linux command line iconv on the text file, but just get errors, same as with utf8trans.
Thanks for any tips!