Python Forum
I wan't to Download all .zip Files From A Website (Project AI)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I wan't to Download all .zip Files From A Website (Project AI)
#41
Here's the deal. This site is very difficult to scrape.
The reason is that the download URL keeps changing (i would guess to prevent bots).
Try it, the url you gave me no longer works, but it did when posted.
This is taking too much of my time, and proving to be much more difficult because of moving target.
Reluctantly I can't spend any more time on it, at least not today (I have surgery in the AM, so have to prepare for that).

I would suggest getting the auto password part (Dead-eye) gave you first, then you can go to first page and run following to get all links:

name this one: Fspaths.py
from pathlib import Path
import os


class Fspaths:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        homepath = Path('.')

        self.datapath = homepath / 'data'
        self.datapath.mkdir(exist_ok=True)
        
        self.htmlpath = self.datapath / 'html'
        self.htmlpath.mkdir(exist_ok=True)

        self.flightsimpath = self.datapath / 'FlightSimFiles'
        self.flightsimpath.mkdir(exist_ok=True)

        self.page1_html = self.htmlpath / 'pagespan.html'
        self.links = self.flightsimpath / 'links.txt'

        self.base_catalog_url = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65893537&page='

if __name__ == '__main__':
    Fspaths()
and this one: ScrapeUrlList.py
import Fspaths
from bs4 import BeautifulSoup
import requests


class ScrapeUrlList:
    def __init__(self):
        self.fpath = Fspaths.Fspaths()
        self.ziplinks = []

    def get_url(self, url):
        page = None
        response = requests.get(url)
        if response.status_code == 200:
            page = response.content
        else:
            print(f'Cannot load URL: {url}')
        return page

    def get_catalog(self):
        base_url = 'https://www.flightsim.com/vbfs'
        with self.fpath.links.open('w' ) as fp:
            baseurl = self.fpath.base_catalog_url
            for pageno in range(1, 254):
                url = f'https://www.flightsim.com/vbfs/fslib.php?searchid=65893537&page={pageno}'
                print(f'url: {url}')
                page = self.get_url(self.fpath.base_catalog_url)
            if page:
                soup = BeautifulSoup(page, 'lxml')
                zip_links = soup.find_all('div', class_="fsc_details")
                for link in zip_links:
                    fp.write(f"{link.find('a').text}, {base_url}/{link.find('a').get('href')}")
                input()
            else:
                print(f'No page: {url}')

def main():
    sul = ScrapeUrlList()
    sul.get_catalog()


if __name__ == '__main__':
    main()
The searchid is what changes, and you need to get a new seed (you can change code to use as an attribute)
before creating the download list.
Then (not written), you need to use the created list to download the zip files

This code will build a directory tree named 'data' from wherever you put the scripts.
the links file is created in a subdirectory named FlightSimFiles
Reply
#42
Many thanks for this Larz60+, which Version of Python, will I need to run these codes in ?
Reply
#43
yes must be 3.6 or newer suggest installing 3.7
you can use an older version, but will have to remove all f-strings, these look like:
url = f'https://www.flightsim.com/vbfs/fslib.php?searchid=65893537&page={pageno}'
and can be replaced with:
url = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65893537&page={}'.format(pageno)
but f-string is so useful, I'd upgrade (to at least 3.6) for that alone.

Please recognize that this program will create a file of urls where you get the .zips
you can call get_url with each or these to download the zip, and then add a write routine to save as 'page' as mode 'wb'
it would be prettier to add a 'savefile=None' attribute to get_url, if populated save the page to that file, and probably a flag attribute,
default zip=False or mode = 'w' to indicate mode.
example (untested):
    def get_url(self, url, savefile=None, mode='w'):
        page = None
        response = requests.get(url)
        if response.status_code == 200:
            page = response.content
            if savefile:
                with savefile.open(mode) as fout:
                    fout.write(page)
        else:
            print(f'Cannot load URL: {url}')
        return page
Reply
#44
Is it possible to have Python, read the File where all the .zip File links are, and then download all the .zip Files from all the links in the File ? Or is that what you mean ?
Reply
#45
that's code you need to write
just pass the urls that are in the links.txt file one at a time to the new get_url
Reply
#46
Hi Larz60+, I am not sure what to write in code, based on what you have told me,
I need to do. Also the links.txt File was created and the links were obtained, but no links where written to that File. What could cause that to happen ? Is that due to a missing Python Module, and if so which one ? Or something missing from the ScrapeUrlList.py Code ? No Traceback Error is shown in the Code, it just finishes obtaining the links, but no Link Data, is written to the File in question.
Reply
#47
Also the links.txt File was created and the links were obtained, but no links where written to that File
did you change the seed line 25 of ScrapeUrlList.py?
you need to log into flight sim and get the base download page.
As I stated before, the seed is changes for each session, so it must be passed to the program.
Output:
url = f'https://www.flightsim.com/vbfs/fslib.php?searchid=65893537&page={pageno}' -------- |___ This changes for each session.
For a quick fix, either replace that number each time or change the code to accept seed from command line argument.
what really needs to be done:
You must first and foremost get the password code that DeaD_EyE provided to work and added to the program.
Once you have done that, you can automatically link to the download page and fetch the seed.

Also you don't have to write the url's to a file, you can just call get_url and download immediately.
Reply
#48
Hi Larz60+,

I ran Dead Eye's Code with the ScrapeUrlList.py Code, and got the following Traceback Error :-

Error:
Username: eddywinch82 Warning (from warnings module): File "C:\Python37\lib\getpass.py", line 100 return fallback_getpass(prompt, stream) GetPassWarning: Can not control echo on the terminal. Warning: Password input may be echoed. Password: duxforded1 Traceback (most recent call last): File "C:/Users/Edward/Desktop/Python 3.7/Combined Code.py", line 39, in <module> session = do_login(credentials) File "C:/Users/Edward/Desktop/Python 3.7/Combined Code.py", line 13, in do_login req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials) NameError: name 'LOGIN_PAGE' is not defined
Reply
#49
You already know that DeaD_EyE's code needs to be modified before it will work.
Don't try to bite off more than you can chew. Concentrate of getting his code to log in.
Once you've done that move forward. You need to learn that to be successful at coding you write code that does one thing, get that to work, then add another until you are there.
There is an alternative, and a valid one. If you don't want to do it yourself, post it in the jobs section. You may find someone that wants to do it all for a small fee.

I may take another look at this in a few days. I just had surgery this morning, and not able, or willing at this point, to work on this any more.

Get the password code to work, read the error tracebacks:
Error:
File "C:/Users/Edward/Desktop/Python 3.7/Combined Code.py", line 13, in do_login req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials) NameError: name 'LOGIN_PAGE' is not defined
is telling you it couldn't find 'LOGIN PAGE' go to the login page, and in your browser:

if firefox, click on Tools --> Web Developer --> Page Source.
You can save the file and then examine in your favorite editor.
Find out what it is expecting for login, and modify the code to do what's required.

If unfamiliar with HTML, take a basic tutorial (W3 schools is good) and learn what you have to know, but just searching will probably get you what you want.
Reply
#50
I realised what was missing from the code :-

I now have the following Line of Code :-

LOGIN_PAGE = 'https://www.flightsim.com/vbfs/login.php?do=login'

However when I type my password, and press enter after maybe 10 seconds,

The message "Login Unsuccessful" appears, any Ideas Dead_Eye what the issue is here ?

I am considering putting these Codes in the Job Section.
Is there anyone willing to sort it out for me ? For a small fee ? And if so how much ?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Website scrapping and download santoshrane 3 4,319 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Login and download an exported csv file within a ribbon/button in a website Alekhya 0 2,658 Feb-26-2021, 04:15 PM
Last Post: Alekhya
  Cant Download Images from Unsplash Website firaki12345 1 2,296 Feb-08-2021, 04:15 PM
Last Post: buran
  Download some JPG files and make it a single PDF & share it rompdeck 5 5,651 Jul-31-2020, 01:15 AM
Last Post: Larz60+
  download pdf file from website m_annur2001 1 2,987 Jun-21-2019, 05:03 AM
Last Post: j.crater
  Access my webpage and download files from Python Pedroski55 7 5,616 May-26-2019, 12:08 PM
Last Post: snippsat
  Download all secret links from a map design website fyec 0 2,844 Jul-24-2018, 09:08 PM
Last Post: fyec
  I Want To Download Many Files Of Same File Extension With Either Wget Or Python, eddywinch82 15 14,487 May-20-2018, 06:05 PM
Last Post: eddywinch82

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020