Sep-02-2018, 05:09 AM
Hi, I'm playing with crawling Bing web search page using python3.
I find the raw content received looks like byte type though a bit weird than usual, but then my attempt to decompress the content has failed.
So now I have no idea what kind of data format is this content and what should I do to it.
Does someone have clue what kind of data is this, and how should I extract readable text information from this raw content? Thanks!
My code pasted below displays the raw content and then tries to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines.
I find the raw content received looks like byte type though a bit weird than usual, but then my attempt to decompress the content has failed.
So now I have no idea what kind of data format is this content and what should I do to it.
Does someone have clue what kind of data is this, and how should I extract readable text information from this raw content? Thanks!
My code pasted below displays the raw content and then tries to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines.
import urllib.request as Request import gzip req = Request.Request('www.bing.com') req.add_header('upgrade-insecure-requests', 1) res = Request.urlopen(req).read() print("RAW Content: %s" %ResPage) # show raw content of web print("Try decompression:") print(gzip.decompress(ResPage)) # try decompression
Output:RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...
Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')
Process finished with exit code 1