Python Forum

Full Version: urlib2 issues
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
is urllib2 a good choice for getting website contents? it is used in that code i posted regarding catchin IOError failing. it gets my website OK as well as google.  but when i append the search stuff it fails on both sites (while lynx and curl work).  i cross checked what urllib2 and lynx sent to a dummy local website and they both look right.

more testing:

note that my website does not have /search script so i just used the apex page for these tests.

Output:
lt1/forums /home/forums 1> python filetopystr.py 0 'http://www.google.com/' ip 80|wc -c connecting to 'http://www.google.com/' connected  to 'http://www.google.com/' reading  from 'http://www.google.com/' read 10872 bytes 11478 lt1/forums /home/forums 2> python filetopystr.py 0 'http://linuxhomepage.com/' ip 80|wc -c connecting to 'http://linuxhomepage.com/' connected  to 'http://linuxhomepage.com/' reading  from 'http://linuxhomepage.com/' read 30994 bytes 32900 lt1/forums /home/forums 3> python filetopystr.py 0 'http://ipv6.linuxhomepage.com/' ip 80|wc -c connecting to 'http://ipv6.linuxhomepage.com/' connected  to 'http://ipv6.linuxhomepage.com/' reading  from 'http://ipv6.linuxhomepage.com/' read 30988 bytes 32894 lt1/forums /home/forums 4> python filetopystr.py 0 'http://www.google.com/search?hl=en&as_q=what+is+my+ip+address' ip 80|wc -c connecting to 'http://www.google.com/search?hl=en&as_q=what+is+my+ip+address' Error connecting to 'http://www.google.com/search?hl=en&as_q=what+is+my+ip+address' 0 lt1/forums /home/forums 5> python filetopystr.py 0 'http://linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' ip 80|wc -c connecting to 'http://linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' connected  to 'http://linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' reading  from 'http://linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' read 30994 bytes 32900 lt1/forums /home/forums 6> python filetopystr.py 0 'http://ipv6.linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' ip 80|wc -c connecting to 'http://ipv6.linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' connected  to 'http://ipv6.linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' reading  from 'http://ipv6.linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address' read 30988 bytes 32894 lt1/forums /home/forums 7> lynx -mime_header 'http://www.google.com/search?hl=en&as_q=what+is+my+ip+address'|wc -c 19816 lt1/forums /home/forums 8> lynx -mime_header 'http://linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address'|wc -c 29505 lt1/forums /home/forums 9> lynx -mime_header 'http://ipv6.linuxhomepage.com/?hl=en&as_q=what+is+my+ip+address'|wc -c 29499 lt1/forums /home/forums 10>
I use urllib.requests which has worked well for me, both binary, text, and html files
here's the class I use:
# GetUrl - Fetch files or web pages from internet
#
# Author: Larz60+
import urllib.request as ur
import os
from time import sleep
import requests

class GetUrl:
    def __init__(self, returndata=False):
        self.returndata = returndata

    def get_url(self, url, tofile=None, bin=False):
        head, tail = os.path.split(url)
        try:
            if tofile:
                if os.path.exists(tofile):
                    os.remove(tofile)
                if bin:
                    with open(tofile, 'wb') as f:
                        rdata = requests.get(url)
                        # rdata = ur.urlopen(url).read()
                        f.write(rdata)
                else:
                    with open(tofile, 'w') as f:
                        rdata = requests.get(url)
                        # rdata = ur.urlopen(url).read().decode('utf8')
                        f.write(rdata)
                sleep(.5)
            else:
                rdata = requests.get(url)
                # rdata = ur.urlopen(url).read().decode('utf8')
                return rdata
        except Exception as e:
            print(str(e))

if __name__ == '__main__':
    url = 'ftp://ftp.nasdaqtrader.com/symboldirectory/phlxListedStrikesWithOptionIds.zip'
    tofile = 'G:\python\stock_market\symbols\data\DailyFiles\\USA\phlxListedStrikesWithOptionIds.zip'
    p = GetUrl()
    p.get_url(url, tofile, bin=True)
Requests is a lot better than(Python 2.x - urllib2) and (Python 3.x - urllib.request).
You get correct encoding from source,and don't have guess as in (Python 3.x-urllib.request)

To get executed/rendered JavaScript in DOM,both urllib and Requests fail.
Here i use Selenium/PhantomJS and eg send browser.page_source to BeautifulSoup.
i'm just trying to get the result of the search on google for 'what is my ip address'. google provides the requester's IP address in the results. i intend to extract that. i'll be doing similar at other websites, too. so all i want is the results given a URL for a bunch of URLs i have.
So just standard web-scraping,but why you want to do that from a google search?
The source is large,and can fail with SEO changes.
Can you just parse out from one of the many websites that do this eg whatismyipaddress.
Remember that some have API that give back IP in json to.
I prefer requests module too. Also it can handle cookies if it's needed.
But why are you asking Google for your IP? There are services which do exactly this. They even return different data format such a json for instance. Look at my first script ever.

http://python-forum.org/viewtopic.php?f=11&t=20075
(Dec-29-2016, 08:03 AM)snippsat Wrote: [ -> ]So just standard web-scraping,but why you want to do that from a google search?
The source is large,and can fail with SEO changes.
Can you just parse out from one of the many websites that do this eg whatismyipaddress.
Remember that some have API that give back IP in json to.

i have seen random errors from more than on such site.  i hame also seen them (or their providers) down for various lengths of time (mostly at night a one of them was down for 2 days in a row).  i regulaly use them now to routinely confirm that my VPN (that hides my activity from my ISP, and gives me my own IPv6) is up.   what i am trying to make is a tool that scrapes several of them in parallel, with timeouts, and gives me the address the majority find.

(Dec-29-2016, 08:22 AM)wavic Wrote: [ -> ]I prefer requests module too. Also it can handle cookies if it's needed.
But why are you asking Google for your IP? There are services which do exactly this. They even return different data format such a json for instance. Look at my first script ever.

http://python-forum.org/viewtopic.php?f=11&t=20075

google is just one the sites i will scrape.  i also expect it to be the most reliable.  i will be doing many of them.
Try to change the user agent. To Mozilla or something else. By default socket module, urllib/2, I think requests too present themselves as Python. Some sites don't tolerate access from scripts.
oops!!! this project is one of a few that are stuck on python included modules.  the script has to run when downloaded without doing any installs anywhere.  the user running it might not even have pip.  and they might have only one of python2 or python3 so must run correctly on everything (at least from 2.6 to 3.5).  getting the IP is just a part of this one project. it might also be needed in a couple more.
Changing the user agent will not reflect to execution of the code. It's just a parameter passed to http get method

See this

Here is a list of 'some' user agents
Pages: 1 2