Python Forum

Full Version: I wan't to Download all .zip Files From A Website (Project AI)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
(Aug-26-2018, 11:18 AM)eddywinch82 Wrote: [ -> ]How do I do that snippsat ? Thanks guys, for all your input.
Are you a member with working username and password to that site?
You see in @DeaD_EyE post #3 that he try to log in.
This can be hard to figure out for some sites.

I would use Selenium to do log in,if there is top much struggle with Requests.
Then give source code to BS for parsing.
Hi guys, snippsat I tried logging in with selenium, instead of requests, i.e. I used import selenium and I can't with that module either, I get the same error message, I got when running with requests with the codes. Also I have tried running both yours and Larz60 's codes for getting the File Path Data etc, and both have Syntax error, when I run them in Python. I am assuming that the coding worked for you both, in both cases ?

I have checked the coding, and I have copied both codes correctly.

Also snippsat you said "Or write a code that goes through all pages(simple page system 2,3,4, etc...) and download." How do I do that ?

Hi snippsat, I have attempted to adapt the coding you did for me, a while back for the Project AI Website .zip Files I wanted to download, But it hasn't worked, where am I going wrong ? Here is the adapted Code :-

from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
 
def all_planes():
    '''Generate url links for all planes'''
    url = 'https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62'
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    td = soup.find_all('td', width="50%")
    plain_link = [link.find('a').get('href') for link in td]
    for ref in tqdm(plain_link):
         url_file_id = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65857709{}'.format(ref)
         yield url_file_id
 
def download(all_planes_pages):
    '''Download zip for 1 plain,feed with more url download all planes'''
    # A_300 = next(all_planes())  # Test with first link
    last_253 = islice(all_planes_pages(), 0, 253)
    for plane_page_url in last_253:
        url_get = requests.get(plane_page_url)
        soup = BeautifulSoup(url_get.content, 'lxml')
        td = soup.find_all('td', class_="text", colspan="2")
        zip_url = 'https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid={}'
        for item in tqdm(td):
            zip_name = item.text
            zip_number = item.find('a').get('href').split('=')[-1]
            with open(zip_name, 'wb')  as f_out:
                down_url = requests.get(zip_url.format(zip_number))
                f_out.write(down_url.content)
 
if __name__ == '__main__':
    download(all_planes_pages)
Eddie
worked for me, but today I see that the same url is now not accessible without a password, so someone has tightened the security.
Scraping is always touchy, and what works today often will not work tomorrow.
If you haven't done so already, you should (need to?) run through snippsat's tutorials here:
part1
part2
Hi Guys, I combined coding I found from someone, on the Internet for Web-Scraping ZIP Files.
With Your Code DeadEye, here is the Combined code :-

import sys
import getpass
import hashlib
import requests
 
 
BASE_URL = 'https://www.flightsim.com/'
 
 
def do_login(credentials):
    session = requests.Session()
    session.get(BASE_URL)
    req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials)
    if req.status_code != 200:
        print('Login not successful')
        sys.exit(1)
    # session is now logged in
    return session
 
 
def get_credentials():
    username = input('Username: ')
    password = getpass.getpass()
    password_md5 = hashlib.md5(password.encode()).hexdigest()
    return {
        'cookieuser': 1,
        'do': 'login',
        's': '',
        'securitytoken': 'guest',
        'vb_login_md5_password': password_md5,
        'vb_login_md5_password_utf': password_md5,
        'vb_login_password': '',
        'vb_login_password_hint': 'Password',
        'vb_login_username': username,
        }
 
 
credentials = get_credentials()
session = do_login()

import urllib2
from urllib2 import Request, urlopen, URLError
#import urllib
import os
from bs4 import BeautifulSoup


#Create a new directory to put the files into
#Get the current working directory and create a new directory in it named test
cwd = os.getcwd()
newdir = cwd +"\\test"
print "The current Working directory is " + cwd
os.mkdir( newdir, 0777);
print "Created new directory " + newdir
newfile = open('zipfiles.txt','w')
print newfile


print "Running script.. "
#Set variable for page to be open and url to be concatenated 
url = "http://www.flightsim.com"
page = urllib2.urlopen('https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62').read()

#File extension to be looked for. 
extension = ".zip"

#Use BeautifulSoup to clean up the page
soup = BeautifulSoup(page)
soup.prettify()

#Find all the links on the page that end in .zip
for anchor in soup.findAll('a', href=True):
    links = url + anchor['href']
    if links.endswith(extension):
        newfile.write(links + '\n')
newfile.close()

#Read what is saved in zipfiles.txt and output it to the user
#This is done to create presistent data 
newfile = open('zipfiles.txt', 'r')
for line in newfile:
    print line + '/n'
newfile.close()

#Read through the lines in the text file and download the zip files.
#Handle exceptions and print exceptions to the console
with open('zipfiles.txt', 'r') as url:
    for line in url:
        if line:
            try:
                ziplink = line
                #Removes the first 48 characters of the url to get the name of the file
                zipfile = line[48:]
                #Removes the last 4 characters to remove the .zip
                zipfile2 = zipfile[:3]
                print "Trying to reach " + ziplink
                response = urllib2.urlopen(ziplink)
            except URLError as e:
                if hasattr(e, 'reason'):
                    print 'We failed to reach a server.'
                    print 'Reason: ', e.reason
                    continue
                elif hasattr(e, 'code'):
                    print 'The server couldn\'t fulfill the request.'
                    print 'Error code: ', e.code
                    continue
            else:
                zipcontent = response.read()
                completeName = os.path.join(newdir, zipfile2+ ".zip")
                with open (completeName, 'w') as f:
                    print "downloading.. " + zipfile
                    f.write(zipcontent)
                    f.close()
print "Script completed"


But I get the following Traceback Error, the coding runs ok initially, allowing me to type my Username. But I get the following Error Message after I hit enter :-

Error:
Traceback (most recent call last): File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 38, in <module> credentials = get_credentials() File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials username = input('Username: ') File "<string>", line 1, in <module> NameError: name '......' is not defined
Any ideas where I am going wrong ?

Eddie
Line 22 is where you input your user name (I removed actual username):
username = input('Username: ')
This is where the error traceback is showing the error,
Error:
File "C:\Users\Edward\Desktop\Python 2.79\Web Scraping Code For .ZIP Files 3.py", line 22, in get_credentials username = input('Username: ') File "<string>", line 1, in <module> NameError: name '......' is not defined
The last line number is usually where the error is encountered
however I don't see an issue here

One last note. If you must use python 2, would you at least put print statements in parenthesis?
My Username is eddywinch82 where do I type that on Line 22 ?

Should I type :-

eddywinch82 = input('Username: ')
you enter it real time, while running script.
That's what I was doing, do you have an idea, what the issue is here ?
you're using antique python, it's raw_input
I was using Python 3.43 before, and the same problem was occuring then.
Pages: 1 2 3 4 5 6 7