Python Forum
How to convert Python crawled Bing web page content to human readable?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to convert Python crawled Bing web page content to human readable?
#1
Hi, I'm playing with crawling Bing web search page using python3.
I find the raw content received looks like byte type though a bit weird than usual, but then my attempt to decompress the content has failed.
So now I have no idea what kind of data format is this content and what should I do to it.
Does someone have clue what kind of data is this, and how should I extract readable text information from this raw content? Thanks!

My code pasted below displays the raw content and then tries to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines.

import urllib.request as Request
import gzip

req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage))   # try decompression
Output:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ... Try decompression: Traceback (most recent call last): OSError: Not a gzipped file (b'+p') Process finished with exit code 1
Reply
#2
You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.
Reply
#3
(Sep-02-2018, 07:05 AM)Gribouillis Wrote: You could save the file and hand it to file type detection tools. There are online services for this such as checkfiletype.com (which I haven't tried).

In linux, the 'file' command may find something.

Thanks for the hint, I tried and checkfiletype.com tells me the file type is 'application/octet-stream'.
I have no idea why a search engine would response with such a binary stream content.
Do you have any clue why this is happening?

Regards
Reply
#4
I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.

import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])
Recommended Tutorials:
Reply
#5
(Sep-02-2018, 02:24 PM)metulburr Wrote: I dont even get that with your code. I get html code if i change your respage variable to res. Where do you define ResPage? As you only define res.

Anyways the more common approach now is to use the requests module.

import requests
 
r = requests.get('http://www.bing.com')
print(r.text)
print(r.headers['content-type'])

Thanks a lot!
Sorry for the confusion about the variable names, I meant to use the 'res' all the way.
Seems I need to abandon the old school way of doing the http things...
I just have one more question: should I regard Requests as a total replacement for the old urllib things? Would you mind give some sayings about this?

Regards
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  io.UnsupportedOperation: not readable RedSkeleton007 2 18,574 Nov-06-2023, 06:32 AM
Last Post: gpurdy
  Python SSL web page scraping Vadanane 1 874 Jan-13-2023, 04:11 PM
Last Post: snippsat
  Human Sorting (natsort) does not work [SOLVED] AlphaInc 2 1,095 Jul-04-2022, 10:21 AM
Last Post: AlphaInc
  How to make x-axis readable with matplotlib Mark17 7 3,815 Mar-01-2022, 04:30 PM
Last Post: DPaul
  Function global not readable by 'main' fmr300 1 1,296 Jan-16-2022, 01:18 AM
Last Post: deanhystad
  sorting alphanumeric values in a human way idiotonboarding 3 2,555 Jan-22-2021, 05:57 PM
Last Post: idiotonboarding
  io.UnsupportedOperation: not readable navidmo 1 3,468 Oct-31-2019, 11:04 PM
Last Post: ichabod801
  Display output in readable format and save hnkrish 1 2,588 Jul-19-2019, 09:29 AM
Last Post: Larz60+
  Batch job from epoch to human time jheeman 6 4,439 Feb-27-2018, 10:53 PM
Last Post: jheeman
  Time Difference in Epoch Microseconds then convert to human readable firesh 4 11,542 Feb-27-2018, 09:08 AM
Last Post: firesh

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020