Pedroski55 Wrote:When I saved the data I retrieved as a text file, what should be Chinese characters,Have to careful to keep Unicode use utf-8,when take text out of Python 3.
Example Requests and BeautifulSoup will keep correct encoding from a web-site.
from bs4 import BeautifulSoup import requests url = 'http://www.sohu.com' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') text = soup.select('div.news > p:nth-child(1) > a')Test:
>>> text [<a data-param="&_f=index_cpc_0" href="http://www.sohu.com/a/298818150_428290?g=0?code=36b1c5f548e7c32034c382e96f3e401" target="_blank" title="全国政协十三届二次会议在京开幕">全国政协十三届二次会议在京开幕</a>] >>> text[0].attrs['title'] '全国政协十三届二次会议在京开幕'Saving to disk i do not need to use gb2312,always
utf-8
when Unicode show correct in Python 3.Unicode improvement was one biggest change moving from Python 2 to 3.
# Write to disk ch = '全国政协十三届二次会议在京开幕' with open('ch.txt', 'w', encoding='utf-8') as f_out: f_out.write(ch) # Read from disk with open('ch.txt', encoding='utf-8') as f: print(f.read())In and out still correct:
Output:全国政协十三届二次会议在京开幕