how to read chinese character?

how to read chinese character? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: how to read chinese character? (/thread-38039.html)

how to read chinese character? - kucingkembar - Aug-25-2022

Hi, I try to translate a Chinese page to an English page,
but the result is gibberish,
how to "convert" It to Chinese?

# importing the modules
import requests
from bs4 import BeautifulSoup
 
# target url
url = "https://www.boshisw.com/boshi/14_14309/"
 
# making requests instance
reqs = requests.get(url)
 
# using the BeautifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
 
# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
    print(title.get_text())

Error:Title of the website is : 
ÎÒÔÚÔÊ¼Éç»áµ±´å³¤×îÐÂÕÂ½ÚÁÐ±í_ÎÒÔÚÔÊ¼Éç»áµ±´å³¤×îÐÂÕÂ½ÚÄ¿Â¼_²©ÊËÊéÎÝ

thank you for reading, have a nice day

RE: how to read chinese character? - snippsat - Aug-25-2022

Change to reqs.content.
This mean that Bs4 is given bytes and it will deal with Unicode,using reqs.text it can be mix up between Requests and Bs4.
Encodings

Quote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode:
Unicode, Dammit guesses correctly most of the time.

# importing the modules
import requests
from bs4 import BeautifulSoup

# target url
url = "https://www.boshisw.com/boshi/14_14309/"

# making requests instance
reqs = requests.get(url)

# using the BeautifulSoup module
soup = BeautifulSoup(reqs.content, 'html.parser')
print(type(soup))

# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
    print(title.get_text())

Output:
我在原始社会当村长最新章节列表_我在原始社会当村长最新章节目录_博仕书屋

RE: how to read chinese character? - kucingkembar - Aug-25-2022

(Aug-25-2022, 07:47 PM)snippsat Wrote: Change to reqs.content.
This mean that Bs4 is given bytes and it will deal with Unicode,using reqs.text it can be mix up between Requests and Bs4.
Encodings

Quote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode:
Unicode, Dammit guesses correctly most of the time.
# importing the modules
import requests
from bs4 import BeautifulSoup

# target url
url = "https://www.boshisw.com/boshi/14_14309/"

# making requests instance
reqs = requests.get(url)

# using the BeautifulSoup module
soup = BeautifulSoup(reqs.content, 'html.parser')
print(type(soup))

# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
    print(title.get_text())
Output:
我在原始社会当村长最新章节列表_我在原始社会当村长最新章节目录_博仕书屋

thank you, i looking this for hours,
i give you reputation point