Posts: 7,237
Threads: 122
Joined: Sep 2016
(May-21-2018, 12:43 PM)eddywinch82 Wrote: what do I need to type, so that the Files download, with the proper .zip File name ? The code i posted with concurrent.futures was just a quick test to show how it can be done,
you shall not try to use concurrent.futures until all work as it should first.
You have to parse name as i did in your other post #12 with .utu files.
It's not so easy because you struggle with basic Python understating.
Posts: 218
Threads: 27
Joined: May 2018
Thanks snippsat I will look into that. Do you know, what the Traceback Errors I posted, just before mean ? Do you know any other programs, to increase download speeds in Python ? I got Traceback errors when using Axel aswell.
Posts: 7,237
Threads: 122
Joined: Sep 2016
May-21-2018, 11:25 PM
(This post was last modified: May-21-2018, 11:25 PM by snippsat.)
You can try this,i did look at download all .zip for all planes.
I let it run about 5-minute had no errors.
So if this is one time operation it may not be worth looking into concurrent.futures as i did show before.
Take break for a couple of hours ,and see if you have gotten all zip files.
from bs4 import BeautifulSoup
import requests
def all_planes():
'''Generate url links for all planes'''
url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', width="50%")
plain_link = [link.find('a').get('href') for link in td]
all_links = []
for ref in plain_link:
url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref)
yield url_file_id
def download(all_planes):
'''Download zip for one plane,feed with more url's will download .zip for all planes'''
# A_300 = next(all_planes()) # Test with first link
for plane_url in all_planes():
url_get = requests.get(plane_url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', class_="text", colspan="2")
zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}'
for item in td:
zip_name = item.text
zip_number = item.find('a').get('href').split('=')[-1]
with open(zip_name, 'wb') as f_out:
down_url = requests.get(zip_url.format(zip_number))
f_out.write(down_url.content)
if __name__ == '__main__':
download(all_planes)
Posts: 7,237
Threads: 122
Joined: Sep 2016
May-22-2018, 02:02 AM
(This post was last modified: May-22-2018, 02:03 AM by snippsat.)
For code over can a progress bar be fine to have,as i showed in you other Thread.
So use tqdm.
Then can plug it in in both loops.
Example:
from tqdm import tqdm, trange
# Then in the 2 loops
for ref in tqdm(plain_link):
for item in tqdm(td):
Now can see that's it's 72 planes total.
In plain 3 which downloading now there are 21 .zip files.
Of corse the measure will jump a little as some planes have more .zip files.
Plane 2 had 171 .zip files and plane 1 had 4 .zip files.
Posts: 218
Threads: 27
Joined: May 2018
May-22-2018, 06:30 AM
(This post was last modified: May-22-2018, 06:55 AM by eddywinch82.)
Thanks for sorting this out for me snippsat. It's much appreciated, I actually managed to download quit alot of these .zip files last night. After running one of the codes, I allready have. But it stopped downloading after a couple of hours. Maybe I was being blocked by an internet server ? What do I need to do to today, I.e. type in a code, to start downloading from the last .zip File downloaded, rather than downloading all of the downloaded .zip files again ? I mean can I put in a code, the last .zip file downloaded, and then start downloading from that point ?
Posts: 7,237
Threads: 122
Joined: Sep 2016
May-22-2018, 10:52 AM
(This post was last modified: May-22-2018, 10:52 AM by snippsat.)
(May-22-2018, 06:30 AM)eddywinch82 Wrote: , to start downloading from the last .zip File downloaded, rather than downloading all of the downloaded .zip files again ? I mean can I put in a code, the last .zip file downloaded, and then start downloading from that point ? Start over in a new folder with my code that has progress bar,then let say you got all .zip for 69 planes.
The your connection break down,now you know that you miss the last 3 planes.
So in my code i am using yield url_file_id to generate url's for all planes.
The can use itertools.islice to slice out the last 3 that is missing.
Code:
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
def all_planes():
'''Generate url links for all planes'''
url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', width="50%")
plain_link = [link.find('a').get('href') for link in td]
for ref in tqdm(plain_link):
url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref)
yield url_file_id
def download(all_planes):
'''Download zip for 1 plain,feed with more url download all planes'''
# A_300 = next(all_planes()) # Test with first link
last_3 = islice(all_planes(), 69, 72)
for plane_url in last_3:
url_get = requests.get(plane_url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', class_="text", colspan="2")
zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}'
for item in tqdm(td):
zip_name = item.text
zip_number = item.find('a').get('href').split('=')[-1]
with open(zip_name, 'wb') as f_out:
down_url = requests.get(zip_url.format(zip_number))
f_out.write(down_url.content)
if __name__ == '__main__':
download(all_planes) Now looking at progress bar.
After 1 plane is dowloaded it's at 97%,because we start at 69 and total is 72
Posts: 218
Threads: 27
Joined: May 2018
Thanks for that snippsat, how can I find out the total Number of Planes altogether ? Then I can use your new code, when I find out how many I have downloaded. That is easy to do by simply selecting all files in the folder. To find out the number of .zip files I have downloaded. It's the first part I need help with.
Posts: 7,237
Threads: 122
Joined: Sep 2016
(May-22-2018, 11:55 AM)eddywinch82 Wrote: Thanks for that snippsat, how can I find out the total Number of Planes altogether ? It's 72 it should be clear of what i posted.
Remember that planes can differ on how many .zip files they have,
plane-1 has 4 and plane-2 has 171 .zip file.
It should be easy to see with my code when it say 50/72,
it means that you have gotten all .zip for the 50 first planes and 22 is remaining.
Posts: 7,237
Threads: 122
Joined: Sep 2016
Also with my code you can take it in step,no need to download all 72 planes in one go.
Because of islice method on yield ,can start where you want.
Code under take 10 first planes.
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm, trange
from itertools import islice
def all_planes():
'''Generate url links for all planes'''
url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', width="50%")
plain_link = [link.find('a').get('href') for link in td]
for ref in tqdm(plain_link):
url_file_id = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/{}'.format(ref)
yield url_file_id
def download(all_planes):
'''Download zip for 1 plain,feed with more url download all planes'''
# A_300 = next(all_planes()) # Test with first link
how_many_planes = islice(all_planes(), 0, 10)
for plane_url in how_many_planes:
url_get = requests.get(plane_url)
soup = BeautifulSoup(url_get.content, 'lxml')
td = soup.find_all('td', class_="text", colspan="2")
zip_url = 'http://web.archive.org/web/20041108022719/http://www.projectai.com:80/libraries/download.php?fileid={}'
for item in tqdm(td):
zip_name = item.text
zip_number = item.find('a').get('href').split('=')[-1]
with open(zip_name, 'wb') as f_out:
down_url = requests.get(zip_url.format(zip_number))
f_out.write(down_url.content)
if __name__ == '__main__':
download(all_planes) As example 20 next planes.
how_many_planes = islice(all_planes(), 10, 31)
Posts: 218
Threads: 27
Joined: May 2018
May-23-2018, 06:15 AM
(This post was last modified: May-23-2018, 06:22 AM by eddywinch82.)
Thankyou so much snippsat, I ran your new Python Code last night. And now I have downloaded, all the Planes and. Zip files I need. Your help has been very much appreciated. Eddie
|