May-09-2020, 04:10 PM
I've been trying to do something about web crawler,though i m new to this.
And i've just met something annoying
and i've tried to decode it with 'utf-8'
i may have made loads of mistakes with my expression because basically i know nothing about terminology,sorry.
If u have any idea,leave a comment ,really need your help :p
And i've just met something annoying
import urllib.request as req from bs4 import BeautifulSoup import lxml ''' file=open('datacom.txt','r') xml=file.read() ''' website=req.Request(url='http://comment.bilibili.com/182148299.xml') dataset=req.urlopen(website).read() print(dataset) soup=BeautifulSoup(dataset,features='xml') print('-------------------------','\n'soup)And it turns out to be like this
Output:b'\x8c\x92\xbdn\xdc8\x10\xc7\xfb}\nB\x80;Z\xcb\xe1\xd7p\x00I\xee\xee\t|\xf5A+qm\xc1\xbb\xd2aE\x1b{8ly\x87\x03\x0eW\\\x00\x17\xa9R\x06\xa9R\x04)l\xf8qb;\xa9\xf2\n\x81\xa4\xfd\xb0\xd7v\xe0\x05\x16\x18\xfe\x87\xfc\xf1G\x91\xc9\xd1r>c\x17~\xd1VM\x9dF\x10\x8b\x88\xf9\xbah\xca\xaa>I\xa3_\x8f\x7f9t\xd1Q\x96TYR\x9c\xe6\xa1\xf5\x8b\x0b\xbf\xc8\xba2\x9eT\xb3\xaa\xfb\xc7E3O\xc6\x0f\xba\xfd\xcc\xaa\xcc\xc0)C\xa8\xb5\x1b\xbaU\x99%\xf3\xaa\xed\xf6\xc9D2\xde\x94\xc9<_\xce\xaay\x152\x10]\xbc\x19%m\xc8\x83\xeff\x0eE\xb2\xf0\xf9\xec\xb7:\x9f\xf7\xd9n\x90\xb4\xcd\xf9\xa2\xf0\xd9\xd9\xe1E2^\xd7I\xc9~O#\x1b\x83\x04!8pi8XD\x94`8\x18\x87\x04\xa0Qs\xc1\xa7\xce\xa2\x86\xb2\xe0\n\x9c\x15\x16\xb4\xb4\x8a\xc8j\tQv\xf7\xcf\xff\xb7\x7f}\xba\xbd\xb9\xba\xbd\xba\xfcr\xfd\xf7\xf7\x9bw\xc9\xb8\\\xa3ul\xd0=\x8f6\x96\x9c\xdaC;\xadQKk\x14\n+1\xca\xa4Rj\x07\xb3\xb1B|\x01\xe6\x10\xdc>\xcc(\xa7PjK\xc6\x18\x13e\x83\xe1\x0e\xa7b\x0b\xcf\x1e\xdb\x91\xb3h\x81\x0bn\xa6RNJSp%\xb5\xd4\n\xb5\x92\xd2\x90B\xc2(\x0b\xbe\r;\x98\x8b\xa50;\x98\xd5\x96@oa(\x9f\xc2\x9c\x10\xce\x81s\xa4\xd4>\x0cb\x85\x0f`\x8f\xcd\x88\x0c\x17\x1c\x1c\x82r4\x1d`\x9aP[GJ\x82\x80\xa70B\xf5\x02\x0c\x8d\xee\xae\xc0O%\xa06b\x80\xa1\xd3ZY\x92BuW\xf0\x18\xa6cr\xf4\x12\xcc\x02r\xc1\x8b\xb2(s \x1a`N*\x89\xd6\x12Y\xf3\xe4\x98"\x16\xdd\x8f\xd3>\x0cI\x18\xad\x88\xcbG\xf7i:EcH\n\xd2HQV\xfa)\x9b\x9c\x87\xd0\xd4,4g\xe1\xcf\x11c\x8c\x05\xbf\x0cit\xff\xe1\xfa\xdb\xdb\xff\xcc\xfd\xfb7Q\x9f.S#\x0e\xfa\xea\x8fm\x15\xf2\xc5\x89\x0f,e\xad\xf7g\xc3\xe2>\xae\xe6\x9e\xa5\xcc\xb4}\xb2\x1a\xad^\xe7k\xbbO\xbc\xefK\x02\x08\xa5\x94Vt\xef\xaf\xf3\xed\xf4X>\xecV4u\xf0ug\x10\x1d7l\xe2\xfb\xa0\xaa\xcf}\xb9\x91\x06\xb1\xb5Fs0Z\x8dZ\xbf]\xbdd)\xdbtY\xca\xfa>\x83v\x14N}\xcd\x86y+\xa6\xda\xd7\xd9\x93\x14z\xdf\x9e\xfaGJh\x1c9\xf8\xa9\xfd\xdd\xe7\x7f\xbf~\xbc\\K\xb3\x94i\xb3\xf3\xd2\x9d\xf6\xda{\xc5\xcc\xe03\xae\xb2\x1f\x01\x00\x00\xff\xff'
-------------------------
<?xml version="1.0" encoding="utf-8"?>
But actually it should be like thisQuote:<?xml version="1.0" encoding="UTF-8"?><i><chatserver>chat.bilibili.com</chatserver><chatid>182148299</chatid><mission>0</mission><maxlimit>1500</maxlimit><state>0</state><real_name>0</real_name><source>e-r</source><d p="4.47900,1,25,16777215,1588986351,0,5f22bd5c,32424159570558983">test</d><d p="10.16400,1,25,16646914,1588986375,0,5f22bd5c,32424172229492743">test</d></i>
and i've tried to decode it with 'utf-8'
def getmark(self): loc_list=['http://comment.bilibili.com/',self.cid,'.xml'] loc=''.join(loc_list) print(loc) website=req.Request(url=loc) dataset=req.urlopen(website).read() info=dataset.decode('UTF-8') print(info) soup=BeautifulSoup(dataset,features="xml") print(soup)and then comes the error
Error:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte
when i redo it withinfo=dataset.decode('UTF-8',errors='ignore')it just left everything behind
Output:7nq1~ee("쳬M'
<?xml version="1.0" encoding="utf-8"?>
i can tell the website is encoded with 'utf-8' not 'gbk',coz i've already checked it with browser command 'document.charset'.i may have made loads of mistakes with my expression because basically i know nothing about terminology,sorry.
If u have any idea,leave a comment ,really need your help :p