Python Forum
How to convert Python crawled Bing web page content to human readable? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to convert Python crawled Bing web page content to human readable? (/thread-12590.html)



How to convert Python crawled Bing web page content to human readable? - dalaludidu - Sep-02-2018

Hi, I'm playing with crawling Bing web search page using python3.
I find the raw content received looks like byte type though a bit weird than usual, but then my attempt to decompress the content has failed.
So now I have no idea what kind of data format is this content and what should I do to it.
Does someone have clue what kind of data is this, and how should I extract readable text information from this raw content? Thanks!

My code pasted below displays the raw content and then tries to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines.

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression
Output:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ... Try decompression: Traceback (most recent call last): OSError: Not a gzipped file (b'+p') Process finished with exit code 1



RE: How to convert Python crawled Bing web page content to human readable? - Gribouillis - Sep-02-2018

You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.


RE: How to convert Python crawled Bing web page content to human readable? - dalaludidu - Sep-02-2018

(Sep-02-2018, 07:05 AM)Gribouillis Wrote: You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.

Thanks for the hint, I tried and checkfiletype.com tells me the file type is 'application/octet-stream'.
I have no idea why a search engine would response with such a binary stream content.
Do you have any clue why this is happening?

Regards


RE: How to convert Python crawled Bing web page content to human readable? - metulburr - Sep-02-2018

I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.

import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])



RE: How to convert Python crawled Bing web page content to human readable? - dalaludidu - Sep-02-2018

(Sep-02-2018, 02:24 PM)metulburr Wrote: I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.

import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])

Thanks a lot!
Sorry for the confusion about the variable names, I meant to use the 'res' all the way.
Seems I need to abandon the old school way of doing the http things...
I just have one more question: should I regard Requests as a total replacement for the old urllib things? Would you mind give some sayings about this?

Regards