Code Needs finishing Off Help Needed - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Code Needs finishing Off Help Needed (/thread-10427.html) Pages:
1
2
|
RE: Code Needs finishing Off Help Needed - snippsat - May-21-2018 (May-21-2018, 12:43 PM)eddywinch82 Wrote: what do I need to type, so that the Files download, with the proper .zip File name ?The code i posted with concurrent.futures was just a quick test to show how it can be done, you shall not try to use concurrent.futures until all work as it should first. You have to parse name as i did in your other post #12 with .utu files. It's not so easy because you struggle with basic Python understating. RE: Code Needs finishing Off Help Needed - eddywinch82 - May-21-2018 Thanks snippsat I will look into that. Do you know, what the Traceback Errors I posted, just before mean ? Do you know any other programs, to increase download speeds in Python ? I got Traceback errors when using Axel aswell. RE: Code Needs finishing Off Help Needed - snippsat - May-21-2018 You can try this,i did look at download all .zip for all planes. I let it run about 5-minute had no errors. So if this is one time operation it may not be worth looking into concurrent.futures as i did show before. Take break for a couple of hours ,and see if you have gotten all zip files. from bs4 import BeautifulSoup import requests def all_planes(): '''Generate url links for all planes''' url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', width="50%") plain_link = [link.find('a').get('href') for link in td] all_links = [] for ref in plain_link: url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref) yield url_file_id def download(all_planes): '''Download zip for one plane,feed with more url's will download .zip for all planes''' # A_300 = next(all_planes()) # Test with first link for plane_url in all_planes(): url_get = requests.get(plane_url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', class_="text", colspan="2") zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}' for item in td: zip_name = item.text zip_number = item.find('a').get('href').split('=')[-1] with open(zip_name, 'wb') as f_out: down_url = requests.get(zip_url.format(zip_number)) f_out.write(down_url.content) if __name__ == '__main__': download(all_planes) RE: Code Needs finishing Off Help Needed - snippsat - May-22-2018 For code over can a progress bar be fine to have,as i showed in you other Thread. So use tqdm. Then can plug it in in both loops. Example: from tqdm import tqdm, trange # Then in the 2 loops for ref in tqdm(plain_link): for item in tqdm(td): Now can see that's it's 72 planes total. In plain 3 which downloading now there are 21 .zip files. Of corse the measure will jump a little as some planes have more .zip files. Plane 2 had 171 .zip files and plane 1 had 4 .zip files. RE: Code Needs finishing Off Help Needed - eddywinch82 - May-22-2018 Thanks for sorting this out for me snippsat. It's much appreciated, I actually managed to download quit alot of these .zip files last night. After running one of the codes, I allready have. But it stopped downloading after a couple of hours. Maybe I was being blocked by an internet server ? What do I need to do to today, I.e. type in a code, to start downloading from the last .zip File downloaded, rather than downloading all of the downloaded .zip files again ? I mean can I put in a code, the last .zip file downloaded, and then start downloading from that point ? RE: Code Needs finishing Off Help Needed - snippsat - May-22-2018 (May-22-2018, 06:30 AM)eddywinch82 Wrote: , to start downloading from the last .zip File downloaded, rather than downloading all of the downloaded .zip files again ? I mean can I put in a code, the last .zip file downloaded, and then start downloading from that point ?Start over in a new folder with my code that has progress bar,then let say you got all .zip for 69 planes.The your connection break down,now you know that you miss the last 3 planes. So in my code i am using yield url_file_id to generate url's for all planes.The can use itertools.islice to slice out the last 3 that is missing.Code: from bs4 import BeautifulSoup import requests from tqdm import tqdm, trange from itertools import islice def all_planes(): '''Generate url links for all planes''' url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', width="50%") plain_link = [link.find('a').get('href') for link in td] for ref in tqdm(plain_link): url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref) yield url_file_id def download(all_planes): '''Download zip for 1 plain,feed with more url download all planes''' # A_300 = next(all_planes()) # Test with first link last_3 = islice(all_planes(), 69, 72) for plane_url in last_3: url_get = requests.get(plane_url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', class_="text", colspan="2") zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}' for item in tqdm(td): zip_name = item.text zip_number = item.find('a').get('href').split('=')[-1] with open(zip_name, 'wb') as f_out: down_url = requests.get(zip_url.format(zip_number)) f_out.write(down_url.content) if __name__ == '__main__': download(all_planes)Now looking at progress bar. After 1 plane is dowloaded it's at 97%,because we start at 69 and total is 72 RE: Code Needs finishing Off Help Needed - eddywinch82 - May-22-2018 Thanks for that snippsat, how can I find out the total Number of Planes altogether ? Then I can use your new code, when I find out how many I have downloaded. That is easy to do by simply selecting all files in the folder. To find out the number of .zip files I have downloaded. It's the first part I need help with. RE: Code Needs finishing Off Help Needed - snippsat - May-22-2018 (May-22-2018, 11:55 AM)eddywinch82 Wrote: Thanks for that snippsat, how can I find out the total Number of Planes altogether ?It's 72 it should be clear of what i posted. Remember that planes can differ on how many .zip files they have, plane-1 has 4 and plane-2 has 171 .zip file. It should be easy to see with my code when it say 50/72, it means that you have gotten all .zip for the 50 first planes and 22 is remaining. RE: Code Needs finishing Off Help Needed - snippsat - May-22-2018 Also with my code you can take it in step,no need to download all 72 planes in one go. Because of islice method on yield ,can start where you want.Code under take 10 first planes. from bs4 import BeautifulSoup import requests from tqdm import tqdm, trange from itertools import islice def all_planes(): '''Generate url links for all planes''' url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', width="50%") plain_link = [link.find('a').get('href') for link in td] for ref in tqdm(plain_link): url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref) yield url_file_id def download(all_planes): '''Download zip for 1 plain,feed with more url download all planes''' # A_300 = next(all_planes()) # Test with first link how_many_planes = islice(all_planes(), 0, 10) for plane_url in how_many_planes: url_get = requests.get(plane_url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', class_="text", colspan="2") zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}' for item in tqdm(td): zip_name = item.text zip_number = item.find('a').get('href').split('=')[-1] with open(zip_name, 'wb') as f_out: down_url = requests.get(zip_url.format(zip_number)) f_out.write(down_url.content) if __name__ == '__main__': download(all_planes)As example 20 next planes. how_many_planes = islice(all_planes(), 10, 31) RE: Code Needs finishing Off Help Needed - eddywinch82 - May-23-2018 Thankyou so much snippsat, I ran your new Python Code last night. And now I have downloaded, all the Planes and. Zip files I need. Your help has been very much appreciated. Eddie |