Python Forum

Hi, I'm playing with crawling Bing web search page using python3.
I find the raw content received looks like byte type though a bit weird than usual, but then my attempt to decompress the content has failed.
So now I have no idea what kind of data format is this content and what should I do to it.
Does someone have clue what kind of data is this, and how should I extract readable text information from this raw content? Thanks!

My code pasted below displays the raw content and then tries to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines.

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression

Output:RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...

Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')


Process finished with exit code 1

You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.

(Sep-02-2018, 07:05 AM)Gribouillis Wrote: [ -> ]You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.

Thanks for the hint, I tried and checkfiletype.com tells me the file type is 'application/octet-stream'.
I have no idea why a search engine would response with such a binary stream content.
Do you have any clue why this is happening?

Regards

I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.

import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])

(Sep-02-2018, 02:24 PM)metulburr Wrote: [ -> ]I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.
import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])

Thanks a lot!
Sorry for the confusion about the variable names, I meant to use the 'res' all the way.
Seems I need to abandon the old school way of doing the http things...
I just have one more question: should I regard Requests as a total replacement for the old urllib things? Would you mind give some sayings about this?

Regards

dalaludidu

Gribouillis

dalaludidu

metulburr

dalaludidu