Code Needs finishing Off Help Needed - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Code Needs finishing Off Help Needed (/thread-10427.html) Pages:
1
2
|
Code Needs finishing Off Help Needed - eddywinch82 - May-20-2018 I need help, with finishing of the following code :- which is not all my own work, i am not this good at programming, but had help with it yesterday :- from bs4 import BeautifulSoup import requests, wget, re, zipfile, io def get_zips(link_root, zips_suffix): # 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' zips_page = link_root + zips_suffix # print zips_page zips_source = requests.get(zips_page).text zip_soup = BeautifulSoup(zips_source, "html.parser") for zip_file in zip_soup.select("a[href*=download.php?fileid=]"): zip_url = link_root + zip_file['href'] print('downloading', zip_file.text, '...',) r = requests.get(zip_url) with open(zip_file.text, 'wb') as zipFile: zipFile.write(r.content) def download_links(root, cat): url = ''.join([root, cat]) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") for zips_suffix in soup.select("a[href*=repaints.php?ac=]"): get_zips(root, zips_suffix['href']) link_root = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/' # Example category, need to read all categories from first page into a list and iterate categories category = 'acfiles.php?cat=6' download_links(link_root, category)This is the path for one of the Aircraft catagories :- 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' But there are several, where the last part of the path is always :- /repaints.php?ac=Two Digit Number@cat=6 What do i need to type, to download all the .zip Files from http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6, without having to type each different ac=Two Digit Number@cat=6 for that path each time? Any help would be much appreciated. RE: Code Needs finishing Off Help Needed - snippsat - May-20-2018 Here some hints. from bs4 import BeautifulSoup import requests url = 'http://web.archive.org/web/20041114195147/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', class_="text", colspan="2")Look at td: >>> td [<td bgcolor="#FFFF99" class="text" colspan="2"><a href="download.php?fileid=6082">Texture.ABR-Air Contractors A30B_PW CONTRACT.zip</a> </td>, <td bgcolor="#FFCC99" class="text" colspan="2"><a href="download.php?fileid=6177">Texture.AHK-Air Hong Kong A300B AIR HONG KONG.zip</a> </td>, <td bgcolor="#FFFF99" class="text" colspan="2"><a href="download.php?fileid=6084">Texture.FPO-Europe Airpost A30B_GE FRENCH POST.zip</a> </td>, <td bgcolor="#FFCC99" class="text" colspan="2"><a href="download.php?fileid=7223">Texture.HDA-Dragonair Cargo A30BGE DRAGONAIR.zip</a> </td>] >>> for h in td: ... h.a.get('href') ... 'download.php?fileid=6082' 'download.php?fileid=6177' 'download.php?fileid=6084' 'download.php?fileid=7223'So now have all fileid for download,the whole url is the same before, so 6082 is one .zip and change to 6177 is an other .zip file. RE: Code Needs finishing Off Help Needed - eddywinch82 - May-20-2018 I have adapted the working Code, for a different Flight Sim Website Link, but when I run the Code, there are no Traceback Errors, but on investigation no .zip Files appear to be downloading. unless they are ? Have I gone wrong somewhere ? :- from bs4 import BeautifulSoup import requests, wget, re, zipfile, io def get_zips(link_root, zips_suffix): # 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' zips_page = link_root + zips_suffix # print zips_page zips_source = requests.get(zips_page).text zip_soup = BeautifulSoup(zips_source, "html.parser") for zip_file in zip_soup.select("a[href*=download_model.php?fileid=]"): zip_url = link_root + zip_file['href'] print('downloading', zip_file.text, '...',) r = requests.get(zip_url) with open(zip_file.text, 'wb') as zipFile: zipFile.write(r.content) def download_links(root, cat): url = ''.join([root, cat]) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") td = soup.find_all('td', class_="text", colspan="2", bgcolour="#FFFF99", href="download_model.php?fileid=") for h in td: h.a.get('href') link_root = 'http://web.archive.org/web/20050308033321/http://www.projectai.com:80/packages/fde.php' RE: Code Needs finishing Off Help Needed - snippsat - May-20-2018 You don't call functions,then nothing will be happening. download_links function is missing root and cat argument and it don't return anything.
RE: Code Needs finishing Off Help Needed - eddywinch82 - May-20-2018 I see what you mean, category = 'acfiles.php?cat=6' i tried category ='fde.php' but as it wasn't a category it didn't download .zip files, if cat = category what word should be used for php folder extensions ? Also I have installed the Axel Download accelerator into Python, but I am not sure what to type to make it speed up all the .zip file downloads, when I run the modules ? I found the following on a website :- from axel import axel # Download http://someurl/file.zip with 500 parallel connection file_path = axel('http://someurl/file.zip', num_connections=500) RE: Code Needs finishing Off Help Needed - eddywinch82 - May-20-2018 from bs4 import BeautifulSoup import requests, wget, re, zipfile, io def get_zips(link_root, zips_suffix): # 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' zips_page = link_root + zips_suffix # print zips_page zips_source = requests.get(zips_page).text zip_soup = BeautifulSoup(zips_source, "html.parser") for zip_file in zip_soup.select("a[href*=download.php?fileid=]"): zip_url = link_root + zip_file['href'] print('downloading', zip_file.text, '...',) r = requests.get(zip_url) with open(zip_file.text, 'wb') as zipFile: zipFile.write(r.content) def download_links(root, cat): url = ''.join([root, cat]) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") td = soup.find_all('td', class_="text", colspan="2", bgcolour="#FFFF99", href="download.php?fileid=") for h in td: h.a.get('href') for zips_suffix in soup.select("a[href*=repaints.php?ac=]"): get_zips(root, zips_suffix['href']) link_root = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/' # Example category, need to read all categories from first page into a list and iterate categories category = 'acfiles.php?cat=6' download_links(link_root, category) RE: Code Needs finishing Off Help Needed - snippsat - May-21-2018 (May-20-2018, 11:31 PM)eddywinch82 Wrote: need to read all categories from first page into a list and iterate categoriesExample: from bs4 import BeautifulSoup import requests url = 'http://web.archive.org/web/20041225023002/http://www.projectai.com:80/libraries/acfiles.php?cat=6' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') td = soup.find_all('td', width="50%") all_plains = [link.find('a').get('href') for link in td] print(all_plains) eddywinch82 Wrote:Also I have installed the Axel Download accelerator into Python, but I am not sure what to type to make it speed up all the .zip file downloads, when I run the modules ?I have not heard of Alex. I write my own for this,but it's a more advance topic. Can show example with concurrent.futures,that i like to use for this. import requests import concurrent.futures def download(number_id): a_zip = 'http://web.archive.org/web/20041205075703/http://www.projectai.com:80/packages/download_model.php?eula=1&fileid={}'.format(number_id) with open('{}.zip'.format(number_id), 'wb') as f: f.write(requests.get(a_zip).content) if __name__ == '__main__': file_id = list(range(1,50)) with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor: for number_id in file_id: executor.submit(download, number_id)Without it takes 4-5 minute to download,now it take 30-sec. RE: Code Needs finishing Off Help Needed - eddywinch82 - May-21-2018 I get that traceback and then the .zip Files start downloading at the normal speed not quicker. What is the Traceback Text meaning ? EddieAlso for one of the Website links :- I have the following Code from bs4 import BeautifulSoup import requests, zipfile, io, concurrent.futures def download(number_id): a_zip = 'http://web.archive.org/web/20050301025710//http://www.projectai.com:80/packages/download_model.php?eula=1&fileid={}'.format(number_id) with open('{}.zip'.format(number_id), 'wb') as f: f.write(requests.get(a_zip).content) if __name__ == '__main__': file_id = list(range(1,50)) with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor: for number_id in file_id: executor.submit(download, number_id) def get_zips(link_root, zips_suffix): # 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' zips_page = link_root + zips_suffix # print zips_page zips_source = requests.get(zips_page).text zip_soup = BeautifulSoup(zips_source, "html.parser") for zip_file in zip_soup.select("a[href*=download.php?fileid=]"): zip_url = link_root + zip_file['href'] print('downloading', zip_file.text, '...',) r = requests.get(zip_url) with open(zip_file.text, 'wb') as zipFile: zipFile.write(r.content) def download_links(root, cat): url = ''.join([root, cat]) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") link_root = 'http://web.archive.org/web/20050301025710/http://www.projectai.com:80/packages/' category ='fde.php' download_links(link_root,category)but .zip files are not being saved with the proper .zip File name as 49.zip 50.zip 51.zip etc and they say 0 bytes. Or is that because they havn't finished downloading ? Eddie Also for one of the Website links :- I have the following Code from bs4 import BeautifulSoup import requests, zipfile, io, concurrent.futures def download(number_id): a_zip = 'http://web.archive.org/web/20050301025710//http://www.projectai.com:80/packages/download_model.php?eula=1&fileid={}'.format(number_id) with open('{}.zip'.format(number_id), 'wb') as f: f.write(requests.get(a_zip).content) if __name__ == '__main__': file_id = list(range(1,50)) with concurrent.futures.ProcessPoolExecutor(max_workers=10) as executor: for number_id in file_id: executor.submit(download, number_id) def get_zips(link_root, zips_suffix): # 'http://web.archive.org/web/20050315112710/http://www.projectai.com:80/libraries/repaints.php?ac=89&cat=6' zips_page = link_root + zips_suffix # print zips_page zips_source = requests.get(zips_page).text zip_soup = BeautifulSoup(zips_source, "html.parser") for zip_file in zip_soup.select("a[href*=download.php?fileid=]"): zip_url = link_root + zip_file['href'] print('downloading', zip_file.text, '...',) r = requests.get(zip_url) with open(zip_file.text, 'wb') as zipFile: zipFile.write(r.content) def download_links(root, cat): url = ''.join([root, cat]) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") link_root = 'http://web.archive.org/web/20050301025710/http://www.projectai.com:80/packages/' category ='fde.php' download_links(link_root,category)but .zip files are not being saved with the proper .zip File name, they are being saved as 49.zip 50.zip 51.zip etc and they say 0 bytes. Or is that because they havn't finished downloading ? Eddie Sorry they are being saved as 49.zip 50.zip 51.zip etc RE: Code Needs finishing Off Help Needed - buran - May-21-2018 (May-21-2018, 12:34 PM)eddywinch82 Wrote: but .zip files are not being saved with the proper .zip File name, they are being saved as 49.zip 50.zip 51.zip etc and they say 0 bytes. Or is that because they havn't finished downloading ? EddieNo, it's because that is how you construct the file name in download function (see line#6) RE: Code Needs finishing Off Help Needed - eddywinch82 - May-21-2018 what do I need to type, so that the Files download, with the proper .zip File name ? |