Python Forum

Full Version: I wan't to Download all .zip Files From A Website (Project AI)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7
Hi there,

I downloaded .zip Files, a while back, using a Python Code,which I was very kindly helped with by snippsat, and others on here. I would now like to download all the available Project AI .zip Files, from the www.flightsim.com Website.

I tried to adapt the original code, so it would download all the .zip Files from the www.flightsim.com Website. My Adapted code, won't download the Files unsurprisingly, but no errors either, the code when run does nothing. the Plane .zip Files are not in Plane Categories this time, there are 253 pages, with .zip Files on all 253 Pages about 2500 .zip Files altogether.

The search Id is not the same each time you do a search, the number changes, you simply choose the Category in the File Library, i.e. Project AI Files, and leave the search box blank, if you want to search for all the .zip Files :- Here is my adapted code :-

from bs4 import BeautifulSoup
import requests, zipfile, io, concurrent.futures

def download(number_id):
    a_zip = 'http://www.flightsim.com/vbfs/fslib.php?do=copyright&fid={}'.format(number_id)
    with open('{}.zip'.format(number_id), 'wb') as f:
        f.write(requests.get(a_zip).content)

if __name__ == '__main__':
    file_id = list(range(1,50))
    with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor:
        for number_id in file_id:
            executor.submit(download, number_id)

def get_zips(zips_page):
    # print(zips_page)
    zips_source = requests.get(zips_page).text
    zip_soup = BeautifulSoup(zips_source, "html.parser")
    for zip_file in zip_soup.select("a[href*=fslib.php?searchid=65822324&page=]"):
        zip_url = link_root + zip_file['href']
        print('downloading', zip_file.text, '...',)
        r = requests.get(zip_url)
        with open(zip_file.text, 'wb') as zipFile:
            zipFile.write(r.content)


def download_links(root, page):        
    url = ''.join([root, page])      
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")

    for zips_suffix in soup.select("a[href*=fslib.php?do=copyright&fid=]"):
        # get_zips(root, zips_suffix['href'])
        next_page = ''.join([root, zips_suffix['href']])
        get_zips(next_page)


link_root = 'http://www.flightsim.com/vbfs/fslib.php?'

page = 'do=copyright&fid='
download_links(link_root, page)
Can Someone help me make corrections to my Code ? Or point me in the right direction ?

Any help would be much appreciated

Eddie
Also this is a later Python Code, can it be adapted in the sense of, instead of last number of Planes etc, use last number of pages of the 253 total ? Here is the code, that was used for the Project AI Website .zip Files :-

from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
 
def all_planes():
    '''Generate url links for all planes'''
    url = 'http://web.archive.org/web/20031124231537/http://www.projectai.com:80/libraries/acfiles.php?cat=6'
    url_get = requests.get(url)
    soup = BeautifulSoup(url_get.content, 'lxml')
    td = soup.find_all('td', width="50%")
    plain_link = [link.find('a').get('href') for link in td]
    for ref in tqdm(plain_link):
         url_file_id = 'http://web.archive.org/web/20031124231537/http://www.projectai.com:80/libraries/{}'.format(ref)
         yield url_file_id
 
def download(all_planes):
    '''Download zip for 1 plain,feed with more url download all planes'''
    # A_300 = next(all_planes())  # Test with first link
    last_47 = islice(all_planes(), 25, 72)
    for plane_url in last_47:
        url_get = requests.get(plane_url)
        soup = BeautifulSoup(url_get.content, 'lxml')
        td = soup.find_all('td', class_="text", colspan="2")
        zip_url = 'http://web.archive.org/web/20031124231537/http://www.projectai.com:80/libraries/download.php?fileid={}'
        for item in tqdm(td):
            zip_name = item.text
            zip_number = item.find('a').get('href').split('=')[-1]
            with open(zip_name, 'wb')  as f_out:
                down_url = requests.get(zip_url.format(zip_number))
                f_out.write(down_url.content)
 
if __name__ == '__main__':
    download(all_planes)
Eddie
You need to create a new session with requests.Session().

import sys
import getpass
import hashlib
import requests


BASE_URL = 'https://www.flightsim.com/'


def do_login(credentials):
    session = requests.Session()
    session.get(BASE_URL)
    req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials)
    if req.status_code != 200:
        print('Login not successful')
        sys.exit(1)
    # session is now logged in
    return session


def get_credentials():
    username = input('Username: ')
    password = getpass.getpass()
    password_md5 = hashlib.md5(password.encode()).hexdigest()
    return {
        'cookieuser': 1,
        'do': 'login',
        's': '',
        'securitytoken': 'guest',
        'vb_login_md5_password': password_md5,
        'vb_login_md5_password_utf': password_md5,
        'vb_login_password': '',
        'vb_login_password_hint': 'Password',
        'vb_login_username': username,
        }


credentials = get_credentials()
session = do_login()

Seeking files, works without session. Downloading files, needs a valid login of an user.
I made some example code, to try it out, but until now no success.
EDIT: It seems, that the user is still not logged in. Maybe I'm sending wrong parameters to the form.
The way to solve something like this is to get down to the basics.
Your code seems to run OK up step 30.
the URL for the first request is:
Output:
http://web.archive.org/web/20031124231537/http://www.projectai.com:80/libraries/download.php?fileid={3810}
so try that by utself with requests:
import requests

url = 'http://web.archive.org/web/20031124231537/http://www.projectai.com:80/libraries/download.php?fileid={3810}'

response = requests.get(url)
print('status code: {}'.format(response.status_code))
if response.status_code == 200:
    print('saving page')
    with open('results.html', 'wb') as fp:
        fp.write(response.content)
it returns a 404 error which is:
Quote:404 Not Found
The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.

if you try that url by itself (in browser), it brings you to a wayback machine error page:
Output:
Hrm. The Wayback Machine has not archived that URL. This page is not available on the web because page does not exist
Try it!
If you can find the actual url, then you can go from there (use dead-eye's code)
NOTE: a session is a good idea, but not strictly needed to download zip files, I do it all the time.
You need a session.

Just try it with your browser: https://www.flightsim.com/vbfs/fslib.php...fid=202702

If you see this, then you're logged in: [Image: www.flightsim.com]

If not, you see this: [Image: www.flightsim.com]
Hi guys,

Hi deadeye I get the following error when I run your code :-

Error:
Warning (from warnings module): File "C:\Python34\lib\getpass.py", line 101 return fallback_getpass(prompt, stream) GetPassWarning: Can not control echo on the terminal. Warning: Password input may be echoed. Password: duxforded1 Traceback (most recent call last): File "C:\Users\Edward\Desktop\Python 3.4.3\number 10.py", line 42, in <module> session = do_login() TypeError: do_login() missing 1 required positional argument: 'credentials'
Any ideas what is going wrong there ?

The download links are the following path :- https://www.flightsim.com/vbfs/fslib.php...yright&fid=

with a unique number after the = sign :-

And the page number is the following path :- https://www.flightsim.com/vbfs/fslib.php...37849&page= and the number of the page is after the = sign there are 253 pages in total. The searchid= number changes each time you do a search.

Larz60+ I am not using the Project AI website paths this time, I am using the Flightsim.com Website paths. I appreciate both of you helping me.
I have found out through view page source (right mouse click), that another path to the Project AI File Section is :-

https://www.flightsim.com/vbfs/fslib.php...ch&fsec=62
The following will extract the page links from the web page in your last post, and print the url's that reference pages
(indexes for remaining pages)

it will also print download links and zipfile names
the actual download links appear to be like: https://www.flightsim.com/vbfs/fslib.php...&fid=64358
import requests
from bs4 import BeautifulSoup


class MyAttempt:
    def __init__(self):
        self.build_catalog()

    def build_catalog(self):
        page1_url = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65842563'
        page = self.get_page(page1_url)
        soup = BeautifulSoup(page, 'lxml')
        for link in soup.findAll('a', href=True):
            url = link['href']
            text = link.text
            if 'page=' in url:
                print(f'page in url: {url}\ntext: {text}\n')
            if 'copyright' in url:
                print(f'actual download link: {url}\ntext: {text}\n')
                

    def get_page(self, url):
        ok_status = 200
        page = None
        response = requests.get(url, allow_redirects=False)
        if response.status_code == ok_status:
            page = response.content
        else:
            print(f'Could not load url: {url}')
        return page


if __name__ == '__main__':
    MyAttempt()
Please note copyright!
That output looks little messy for me @Larz60+.

@eddywinch82 looked at code i did before here,
it had some fancy stuff like progress bar and itertools.islice to any .zip file range wanted.

A quick test with link in your last post.
from bs4 import BeautifulSoup
import requests

url = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65852160'
base_url = 'https://www.flightsim.com/vbfs'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
zip_links = soup.find_all('div', class_="fsc_details")
for link in zip_links:
    print(link.find('a').text)
    print('-------------')
    print(f"{base_url}/{link.find('a').get('href')}")
Output:
paidf042.zip https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid=64358 -------------------------- paidf041.zip https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid=64357 -------------------------- paidf040.zip https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid=64356 -------------------------- paidf039.zip https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid=64355 -------------------------- ........
So the .zip and with dowload link,if had beed logged in could download all .zip for that page.
Og write code that go trough all page(simple page system 2,3,4, ect...) and download.
How do I do that snippsat ? Thanks guys, for all your input.
Pages: 1 2 3 4 5 6 7