I wan't to Download all .zip Files From A Website (Project AI) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: I wan't to Download all .zip Files From A Website (Project AI) (/thread-12450.html) |
RE: I wan't to Download all .zip Files From A Website (Project AI) - snippsat - Aug-26-2018 (Aug-26-2018, 11:18 AM)eddywinch82 Wrote: How do I do that snippsat ? Thanks guys, for all your input.Are you a member with working username and password to that site? You see in @DeaD_EyE post #3 that he try to log in. This can be hard to figure out for some sites. I would use Selenium to do log in,if there is top much struggle with Requests. Then give source code to BS for parsing. RE: I wan't to Download all .zip Files From A Website (Project AI) - eddywinch82 - Aug-26-2018 Hi guys, snippsat I tried logging in with selenium, instead of requests, i.e. I used import selenium and I can't with that module either, I get the same error message, I got when running with requests with the codes. Also I have tried running both yours and Larz60 's codes for getting the File Path Data etc, and both have Syntax error, when I run them in Python. I am assuming that the coding worked for you both, in both cases ? I have checked the coding, and I have copied both codes correctly. Also snippsat you said "Or write a code that goes through all pages(simple page system 2,3,4, etc...) and download." How do I do that ? Hi snippsat, I have attempted to adapt the coding you did for me, a while back for the Project AI Website .zip Files I wanted to download, But it hasn't worked, where am I going wrong ? Here is the adapted Code :- from bs4 import BeautifulSoup import requests from tqdm import tqdm, trange from itertools import islice def all_planes(): '''Generate url links for all planes''' url = 'https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', width="50%") plain_link = [link.find('a').get('href') for link in td] for ref in tqdm(plain_link): url_file_id = 'https://www.flightsim.com/vbfs/fslib.php?searchid=65857709{}'.format(ref) yield url_file_id def download(all_planes_pages): '''Download zip for 1 plain,feed with more url download all planes''' # A_300 = next(all_planes()) # Test with first link last_253 = islice(all_planes_pages(), 0, 253) for plane_page_url in last_253: url_get = requests.get(plane_page_url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', class_="text", colspan="2") zip_url = 'https://www.flightsim.com/vbfs/fslib.php?do=copyright&fid={}' for item in tqdm(td): zip_name = item.text zip_number = item.find('a').get('href').split('=')[-1] with open(zip_name, 'wb') as f_out: down_url = requests.get(zip_url.format(zip_number)) f_out.write(down_url.content) if __name__ == '__main__': download(all_planes_pages)Eddie RE: I wan't to Download all .zip Files From A Website (Project AI) - Larz60+ - Aug-26-2018 worked for me, but today I see that the same url is now not accessible without a password, so someone has tightened the security. Scraping is always touchy, and what works today often will not work tomorrow. If you haven't done so already, you should (need to?) run through snippsat's tutorials here: part1 part2 RE: I wan't to Download all .zip Files From A Website (Project AI) - eddywinch82 - Aug-26-2018 Hi Guys, I combined coding I found from someone, on the Internet for Web-Scraping ZIP Files. With Your Code DeadEye, here is the Combined code :- import sys import getpass import hashlib import requests BASE_URL = 'https://www.flightsim.com/' def do_login(credentials): session = requests.Session() session.get(BASE_URL) req = session.post(BASE_URL + LOGIN_PAGE, params={'do': 'login'}, data=credentials) if req.status_code != 200: print('Login not successful') sys.exit(1) # session is now logged in return session def get_credentials(): username = input('Username: ') password = getpass.getpass() password_md5 = hashlib.md5(password.encode()).hexdigest() return { 'cookieuser': 1, 'do': 'login', 's': '', 'securitytoken': 'guest', 'vb_login_md5_password': password_md5, 'vb_login_md5_password_utf': password_md5, 'vb_login_password': '', 'vb_login_password_hint': 'Password', 'vb_login_username': username, } credentials = get_credentials() session = do_login() import urllib2 from urllib2 import Request, urlopen, URLError #import urllib import os from bs4 import BeautifulSoup #Create a new directory to put the files into #Get the current working directory and create a new directory in it named test cwd = os.getcwd() newdir = cwd +"\\test" print "The current Working directory is " + cwd os.mkdir( newdir, 0777); print "Created new directory " + newdir newfile = open('zipfiles.txt','w') print newfile print "Running script.. " #Set variable for page to be open and url to be concatenated url = "http://www.flightsim.com" page = urllib2.urlopen('https://www.flightsim.com/vbfs/fslib.php?do=search&fsec=62').read() #File extension to be looked for. extension = ".zip" #Use BeautifulSoup to clean up the page soup = BeautifulSoup(page) soup.prettify() #Find all the links on the page that end in .zip for anchor in soup.findAll('a', href=True): links = url + anchor['href'] if links.endswith(extension): newfile.write(links + '\n') newfile.close() #Read what is saved in zipfiles.txt and output it to the user #This is done to create presistent data newfile = open('zipfiles.txt', 'r') for line in newfile: print line + '/n' newfile.close() #Read through the lines in the text file and download the zip files. #Handle exceptions and print exceptions to the console with open('zipfiles.txt', 'r') as url: for line in url: if line: try: ziplink = line #Removes the first 48 characters of the url to get the name of the file zipfile = line[48:] #Removes the last 4 characters to remove the .zip zipfile2 = zipfile[:3] print "Trying to reach " + ziplink response = urllib2.urlopen(ziplink) except URLError as e: if hasattr(e, 'reason'): print 'We failed to reach a server.' print 'Reason: ', e.reason continue elif hasattr(e, 'code'): print 'The server couldn\'t fulfill the request.' print 'Error code: ', e.code continue else: zipcontent = response.read() completeName = os.path.join(newdir, zipfile2+ ".zip") with open (completeName, 'w') as f: print "downloading.. " + zipfile f.write(zipcontent) f.close() print "Script completed" But I get the following Traceback Error, the coding runs ok initially, allowing me to type my Username. But I get the following Error Message after I hit enter :- Any ideas where I am going wrong ?Eddie RE: I wan't to Download all .zip Files From A Website (Project AI) - Larz60+ - Aug-26-2018 Line 22 is where you input your user name (I removed actual username): username = input('Username: ')This is where the error traceback is showing the error, The last line number is usually where the error is encounteredhowever I don't see an issue here One last note. If you must use python 2, would you at least put print statements in parenthesis? RE: I wan't to Download all .zip Files From A Website (Project AI) - eddywinch82 - Aug-26-2018 My Username is eddywinch82 where do I type that on Line 22 ? Should I type :- eddywinch82 = input('Username: ') RE: I wan't to Download all .zip Files From A Website (Project AI) - Larz60+ - Aug-26-2018 you enter it real time, while running script. RE: I wan't to Download all .zip Files From A Website (Project AI) - eddywinch82 - Aug-26-2018 That's what I was doing, do you have an idea, what the issue is here ? RE: I wan't to Download all .zip Files From A Website (Project AI) - Larz60+ - Aug-26-2018 you're using antique python, it's raw_input RE: I wan't to Download all .zip Files From A Website (Project AI) - eddywinch82 - Aug-26-2018 I was using Python 3.43 before, and the same problem was occuring then. |