Here's the deal. This site is very difficult to scrape.
The reason is that the download URL keeps changing (i would guess to prevent bots).
Try it, the url you gave me no longer works, but it did when posted.
This is taking too much of my time, and proving to be much more difficult because of moving target.
Reluctantly I can't spend any more time on it, at least not today (I have surgery in the AM, so have to prepare for that).
I would suggest getting the auto password part (Dead-eye) gave you first, then you can go to first page and run following to get all links:
name this one: Fspaths.py
and this one: ScrapeUrlList.py
The searchid is what changes, and you need to get a new seed (you can change code to use as an attribute)
before creating the download list.
Then (not written), you need to use the created list to download the zip files
This code will build a directory tree named 'data' from wherever you put the scripts.
the links file is created in a subdirectory named FlightSimFiles
The reason is that the download URL keeps changing (i would guess to prevent bots).
Try it, the url you gave me no longer works, but it did when posted.
This is taking too much of my time, and proving to be much more difficult because of moving target.
Reluctantly I can't spend any more time on it, at least not today (I have surgery in the AM, so have to prepare for that).
I would suggest getting the auto password part (Dead-eye) gave you first, then you can go to first page and run following to get all links:
name this one: Fspaths.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from pathlib import Path import os class Fspaths: def __init__( self ): os.chdir(os.path.abspath(os.path.dirname(__file__))) homepath = Path( '.' ) self .datapath = homepath / 'data' self .datapath.mkdir(exist_ok = True ) self .htmlpath = self .datapath / 'html' self .htmlpath.mkdir(exist_ok = True ) self .flightsimpath = self .datapath / 'FlightSimFiles' self .flightsimpath.mkdir(exist_ok = True ) self .page1_html = self .htmlpath / 'pagespan.html' self .links = self .flightsimpath / 'links.txt' if __name__ = = '__main__' : Fspaths() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
import Fspaths from bs4 import BeautifulSoup import requests class ScrapeUrlList: def __init__( self ): self .fpath = Fspaths.Fspaths() self .ziplinks = [] def get_url( self , url): page = None response = requests.get(url) if response.status_code = = 200 : page = response.content else : print ( f 'Cannot load URL: {url}' ) return page def get_catalog( self ): with self .fpath.links. open ( 'w' ) as fp: baseurl = self .fpath.base_catalog_url for pageno in range ( 1 , 254 ): print ( f 'url: {url}' ) page = self .get_url( self .fpath.base_catalog_url) if page: soup = BeautifulSoup(page, 'lxml' ) zip_links = soup.find_all( 'div' , class_ = "fsc_details" ) for link in zip_links: fp.write( f "{link.find('a').text}, {base_url}/{link.find('a').get('href')}" ) input () else : print ( f 'No page: {url}' ) def main(): sul = ScrapeUrlList() sul.get_catalog() if __name__ = = '__main__' : main() |
before creating the download list.
Then (not written), you need to use the created list to download the zip files
This code will build a directory tree named 'data' from wherever you put the scripts.
the links file is created in a subdirectory named FlightSimFiles