Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
xml decoding failure(bs4)
#1
I've been trying to do something about web crawler,though i m new to this.

And i've just met something annoying

import urllib.request as req
from bs4 import BeautifulSoup
import lxml

'''
file=open('datacom.txt','r')
xml=file.read()
'''
website=req.Request(url='http://comment.bilibili.com/182148299.xml')
dataset=req.urlopen(website).read()
print(dataset)
soup=BeautifulSoup(dataset,features='xml')
print('-------------------------','\n'soup)
And it turns out to be like this

Output:
b'\x8c\x92\xbdn\xdc8\x10\xc7\xfb}\nB\x80;Z\xcb\xe1\xd7p\x00I\xee\xee\t|\xf5A+qm\xc1\xbb\xd2aE\x1b{8ly\x87\x03\x0eW\\\x00\x17\xa9R\x06\xa9R\x04)l\xf8qb;\xa9\xf2\n\x81\xa4\xfd\xb0\xd7v\xe0\x05\x16\x18\xfe\x87\xfc\xf1G\x91\xc9\xd1r>c\x17~\xd1VM\x9dF\x10\x8b\x88\xf9\xbah\xca\xaa>I\xa3_\x8f\x7f9t\xd1Q\x96TYR\x9c\xe6\xa1\xf5\x8b\x0b\xbf\xc8\xba2\x9eT\xb3\xaa\xfb\xc7E3O\xc6\x0f\xba\xfd\xcc\xaa\xcc\xc0)C\xa8\xb5\x1b\xbaU\x99%\xf3\xaa\xed\xf6\xc9D2\xde\x94\xc9<_\xce\xaay\x152\x10]\xbc\x19%m\xc8\x83\xeff\x0eE\xb2\xf0\xf9\xec\xb7:\x9f\xf7\xd9n\x90\xb4\xcd\xf9\xa2\xf0\xd9\xd9\xe1E2^\xd7I\xc9~O#\x1b\x83\x04!8pi8XD\x94`8\x18\x87\x04\xa0Qs\xc1\xa7\xce\xa2\x86\xb2\xe0\n\x9c\x15\x16\xb4\xb4\x8a\xc8j\tQv\xf7\xcf\xff\xb7\x7f}\xba\xbd\xb9\xba\xbd\xba\xfcr\xfd\xf7\xf7\x9bw\xc9\xb8\\\xa3ul\xd0=\x8f6\x96\x9c\xdaC;\xadQKk\x14\n+1\xca\xa4Rj\x07\xb3\xb1B|\x01\xe6\x10\xdc>\xcc(\xa7PjK\xc6\x18\x13e\x83\xe1\x0e\xa7b\x0b\xcf\x1e\xdb\x91\xb3h\x81\x0bn\xa6RNJSp%\xb5\xd4\n\xb5\x92\xd2\x90B\xc2(\x0b\xbe\r;\x98\x8b\xa50;\x98\xd5\x96@oa(\x9f\xc2\x9c\x10\xce\x81s\xa4\xd4>\x0cb\x85\x0f`\x8f\xcd\x88\x0c\x17\x1c\x1c\x82r4\x1d`\x9aP[GJ\x82\x80\xa70B\xf5\x02\x0c\x8d\xee\xae\xc0O%\xa06b\x80\xa1\xd3ZY\x92BuW\xf0\x18\xa6cr\xf4\x12\xcc\x02r\xc1\x8b\xb2(s \x1a`N*\x89\xd6\x12Y\xf3\xe4\x98"\x16\xdd\x8f\xd3>\x0cI\x18\xad\x88\xcbG\xf7i:EcH\n\xd2HQV\xfa)\x9b\x9c\x87\xd0\xd4,4g\xe1\xcf\x11c\x8c\x05\xbf\x0cit\xff\xe1\xfa\xdb\xdb\xff\xcc\xfd\xfb7Q\x9f.S#\x0e\xfa\xea\x8fm\x15\xf2\xc5\x89\x0f,e\xad\xf7g\xc3\xe2>\xae\xe6\x9e\xa5\xcc\xb4}\xb2\x1a\xad^\xe7k\xbbO\xbc\xefK\x02\x08\xa5\x94Vt\xef\xaf\xf3\xed\xf4X>\xecV4u\xf0ug\x10\x1d7l\xe2\xfb\xa0\xaa\xcf}\xb9\x91\x06\xb1\xb5Fs0Z\x8dZ\xbf]\xbdd)\xdbtY\xca\xfa>\x83v\x14N}\xcd\x86y+\xa6\xda\xd7\xd9\x93\x14z\xdf\x9e\xfaGJh\x1c9\xf8\xa9\xfd\xdd\xe7\x7f\xbf~\xbc\\K\xb3\x94i\xb3\xf3\xd2\x9d\xf6\xda{\xc5\xcc\xe03\xae\xb2\x1f\x01\x00\x00\xff\xff' ------------------------- <?xml version="1.0" encoding="utf-8"?>
But actually it should be like this

Quote:<?xml version="1.0" encoding="UTF-8"?><i><chatserver>chat.bilibili.com</chatserver><chatid>182148299</chatid><mission>0</mission><maxlimit>1500</maxlimit><state>0</state><real_name>0</real_name><source>e-r</source><d p="4.47900,1,25,16777215,1588986351,0,5f22bd5c,32424159570558983">test</d><d p="10.16400,1,25,16646914,1588986375,0,5f22bd5c,32424172229492743">test</d></i>

and i've tried to decode it with 'utf-8'

    def getmark(self):
        loc_list=['http://comment.bilibili.com/',self.cid,'.xml']
        loc=''.join(loc_list)
        print(loc)
        website=req.Request(url=loc)
        dataset=req.urlopen(website).read()
        info=dataset.decode('UTF-8')
        print(info)
        soup=BeautifulSoup(dataset,features="xml")
        print(soup)
and then comes the error

Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte
when i redo it with

info=dataset.decode('UTF-8',errors='ignore')
it just left everything behind

Output:
7nq1~ee("쳬M' <?xml version="1.0" encoding="utf-8"?>
i can tell the website is encoded with 'utf-8' not 'gbk',coz i've already checked it with browser command 'document.charset'.
i may have made loads of mistakes with my expression because basically i know nothing about terminology,sorry.
If u have any idea,leave a comment ,really need your help :p
Reply
#2
Use Requests then you get correct website encoding and don't have to guess.
>>> import requests
>>> 
>>> url='http://comment.bilibili.com/182148299.xml'
>>> website = requests.get(url)
>>> website.encoding
'ISO-8859-1'
Together with BS it's like this.
import requests
from bs4 import BeautifulSoup
import lxml

url = 'http://comment.bilibili.com/182148299.xml'
website = requests.get(url)
dataset = website.content
soup = BeautifulSoup(dataset, features='xml')
print(soup)
Output:
<?xml version="1.0" encoding="utf-8"?> <i><chatserver>chat.bilibili.com</chatserver><chatid>182148299</chatid><mission>0</mission><maxlimit>1500</maxlimit><state>0</state><real_name>0</real_name><source>k-v</source><d p="10.16400,1,25,16646914,1588986375,0,5f22bd5c,32424172229492743">test</d><d p="4.47900,1,25,16777215,1588986351,0,5f22bd5c,32424159570558983">test</d></i>
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Decoding lat/long in file name johnmcd 4 353 Mar-22-2024, 11:51 AM
Last Post: johnmcd
  Enigma Decoding Problem krisarmstrong 4 724 Dec-14-2023, 10:42 AM
Last Post: Larz60+
  Failure to run source command middlestudent 2 690 Sep-22-2023, 01:21 PM
Last Post: buran
  json decoding error deneme2 10 3,627 Mar-22-2023, 10:44 PM
Last Post: deanhystad
  Dickey Fuller failure Led_Zeppelin 4 2,612 Sep-15-2022, 09:07 PM
Last Post: Led_Zeppelin
  Assert failure jtcostel 1 1,638 Sep-03-2021, 05:28 PM
Last Post: buran
  flask app decoding problem mesbah 0 2,347 Aug-01-2021, 08:32 PM
Last Post: mesbah
  Decoding a serial stream AKGentile1963 7 8,528 Mar-20-2021, 08:07 PM
Last Post: deanhystad
  SCIKIT learn failure in mac Perja11 1 2,290 Nov-30-2019, 06:44 PM
Last Post: snippsat
  python3 decoding problem but python2 OK mesbah 0 1,802 Nov-30-2019, 04:42 PM
Last Post: mesbah

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020