xml decoding failure(bs4)

roughstroke · May-09-2020, 04:10 PM

I've been trying to do something about web crawler,though i m new to this.

And i've just met something annoying

import urllib.request as req
from bs4 import BeautifulSoup
import lxml

'''
file=open('datacom.txt','r')
xml=file.read()
'''
website=req.Request(url='http://comment.bilibili.com/182148299.xml')
dataset=req.urlopen(website).read()
print(dataset)
soup=BeautifulSoup(dataset,features='xml')
print('-------------------------','\n'soup)

And it turns out to be like this

Output:b'\x8c\x92\xbdn\xdc8\x10\xc7\xfb}\nB\x80;Z\xcb\xe1\xd7p\x00I\xee\xee\t|\xf5A+qm\xc1\xbb\xd2aE\x1b{8ly\x87\x03\x0eW\\\x00\x17\xa9R\x06\xa9R\x04)l\xf8qb;\xa9\xf2\n\x81\xa4\xfd\xb0\xd7v\xe0\x05\x16\x18\xfe\x87\xfc\xf1G\x91\xc9\xd1r>c\x17~\xd1VM\x9dF\x10\x8b\x88\xf9\xbah\xca\xaa>I\xa3_\x8f\x7f9t\xd1Q\x96TYR\x9c\xe6\xa1\xf5\x8b\x0b\xbf\xc8\xba2\x9eT\xb3\xaa\xfb\xc7E3O\xc6\x0f\xba\xfd\xcc\xaa\xcc\xc0)C\xa8\xb5\x1b\xbaU\x99%\xf3\xaa\xed\xf6\xc9D2\xde\x94\xc9<_\xce\xaay\x152\x10]\xbc\x19%m\xc8\x83\xeff\x0eE\xb2\xf0\xf9\xec\xb7:\x9f\xf7\xd9n\x90\xb4\xcd\xf9\xa2\xf0\xd9\xd9\xe1E2^\xd7I\xc9~O#\x1b\x83\x04!8pi8XD\x94`8\x18\x87\x04\xa0Qs\xc1\xa7\xce\xa2\x86\xb2\xe0\n\x9c\x15\x16\xb4\xb4\x8a\xc8j\tQv\xf7\xcf\xff\xb7\x7f}\xba\xbd\xb9\xba\xbd\xba\xfcr\xfd\xf7\xf7\x9bw\xc9\xb8\\\xa3ul\xd0=\x8f6\x96\x9c\xdaC;\xadQKk\x14\n+1\xca\xa4Rj\x07\xb3\xb1B|\x01\xe6\x10\xdc>\xcc(\xa7PjK\xc6\x18\x13e\x83\xe1\x0e\xa7b\x0b\xcf\x1e\xdb\x91\xb3h\x81\x0bn\xa6RNJSp%\xb5\xd4\n\xb5\x92\xd2\x90B\xc2(\x0b\xbe\r;\x98\x8b\xa50;\x98\xd5\x96@oa(\x9f\xc2\x9c\x10\xce\x81s\xa4\xd4>\x0cb\x85\x0f`\x8f\xcd\x88\x0c\x17\x1c\x1c\x82r4\x1d`\x9aP[GJ\x82\x80\xa70B\xf5\x02\x0c\x8d\xee\xae\xc0O%\xa06b\x80\xa1\xd3ZY\x92BuW\xf0\x18\xa6cr\xf4\x12\xcc\x02r\xc1\x8b\xb2(s \x1a`N*\x89\xd6\x12Y\xf3\xe4\x98"\x16\xdd\x8f\xd3>\x0cI\x18\xad\x88\xcbG\xf7i:EcH\n\xd2HQV\xfa)\x9b\x9c\x87\xd0\xd4,4g\xe1\xcf\x11c\x8c\x05\xbf\x0cit\xff\xe1\xfa\xdb\xdb\xff\xcc\xfd\xfb7Q\x9f.S#\x0e\xfa\xea\x8fm\x15\xf2\xc5\x89\x0f,e\xad\xf7g\xc3\xe2>\xae\xe6\x9e\xa5\xcc\xb4}\xb2\x1a\xad^\xe7k\xbbO\xbc\xefK\x02\x08\xa5\x94Vt\xef\xaf\xf3\xed\xf4X>\xecV4u\xf0ug\x10\x1d7l\xe2\xfb\xa0\xaa\xcf}\xb9\x91\x06\xb1\xb5Fs0Z\x8dZ\xbf]\xbdd)\xdbtY\xca\xfa>\x83v\x14N}\xcd\x86y+\xa6\xda\xd7\xd9\x93\x14z\xdf\x9e\xfaGJh\x1c9\xf8\xa9\xfd\xdd\xe7\x7f\xbf~\xbc\\K\xb3\x94i\xb3\xf3\xd2\x9d\xf6\xda{\xc5\xcc\xe03\xae\xb2\x1f\x01\x00\x00\xff\xff'
-------------------------
<?xml version="1.0" encoding="utf-8"?>

But actually it should be like this

Quote:<?xml version="1.0" encoding="UTF-8"?><i><chatserver>chat.bilibili.com</chatserver><chatid>182148299</chatid><mission>0</mission><maxlimit>1500</maxlimit><state>0</state><real_name>0</real_name><source>e-r</source><d p="4.47900,1,25,16777215,1588986351,0,5f22bd5c,32424159570558983">test</d><d p="10.16400,1,25,16646914,1588986375,0,5f22bd5c,32424172229492743">test</d></i>

and i've tried to decode it with 'utf-8'

    def getmark(self):
        loc_list=['http://comment.bilibili.com/',self.cid,'.xml']
        loc=''.join(loc_list)
        print(loc)
        website=req.Request(url=loc)
        dataset=req.urlopen(website).read()
        info=dataset.decode('UTF-8')
        print(info)
        soup=BeautifulSoup(dataset,features="xml")
        print(soup)

and then comes the error

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte

when i redo it with

info=dataset.decode('UTF-8',errors='ignore')

it just left everything behind

Output:7nq1~ee("쳬M'  
<?xml version="1.0" encoding="utf-8"?>

i can tell the website is encoded with 'utf-8' not 'gbk',coz i've already checked it with browser command 'document.charset'.
i may have made loads of mistakes with my expression because basically i know nothing about terminology,sorry.
If u have any idea,leave a comment ,really need your help :p

***snippsat*** · (This post was last modified: May-09-2020, 04:37 PM by snippsat.)

Use Requests then you get correct website encoding and don't have to guess.

>>> import requests
>>> 
>>> url='http://comment.bilibili.com/182148299.xml'
>>> website = requests.get(url)
>>> website.encoding
'ISO-8859-1'

Together with BS it's like this.

import requests
from bs4 import BeautifulSoup
import lxml

url = 'http://comment.bilibili.com/182148299.xml'
website = requests.get(url)
dataset = website.content
soup = BeautifulSoup(dataset, features='xml')
print(soup)

Output:<?xml version="1.0" encoding="utf-8"?>
<i><chatserver>chat.bilibili.com</chatserver><chatid>182148299</chatid><mission>0</mission><maxlimit>1500</maxlimit><state>0</state><real_name>0</real_name><source>k-v</source><d p="10.16400,1,25,16646914,1588986375,0,5f22bd5c,32424172229492743">test</d><d p="4.47900,1,25,16777215,1588986351,0,5f22bd5c,32424159570558983">test</d></i>

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Decoding lat/long in file name	johnmcd	4	382	Mar-22-2024, 11:51 AM Last Post: johnmcd
	Enigma Decoding Problem	krisarmstrong	4	738	Dec-14-2023, 10:42 AM Last Post: Larz60+
	Failure to run source command	middlestudent	2	701	Sep-22-2023, 01:21 PM Last Post: buran
	json decoding error	deneme2	10	3,663	Mar-22-2023, 10:44 PM Last Post: deanhystad
	Dickey Fuller failure	Led_Zeppelin	4	2,628	Sep-15-2022, 09:07 PM Last Post: Led_Zeppelin
	Assert failure	jtcostel	1	1,645	Sep-03-2021, 05:28 PM Last Post: buran
	flask app decoding problem	mesbah	0	2,359	Aug-01-2021, 08:32 PM Last Post: mesbah
	Decoding a serial stream	AKGentile1963	7	8,585	Mar-20-2021, 08:07 PM Last Post: deanhystad
	SCIKIT learn failure in mac	Perja11	1	2,297	Nov-30-2019, 06:44 PM Last Post: snippsat
	python3 decoding problem but python2 OK	mesbah	0	1,807	Nov-30-2019, 04:42 PM Last Post: mesbah

xml decoding failure(bs4)

User Panel Messages

Announcements