Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Can not make this image downloader work
#1
Hi,
I am trying to make a maps down loader. I got everything working good, but for the last for loop. I been working on this for a lone time. here id the code:


import requests
import bs4 as bs
import urllib.request

url = str(input('URL: '))

opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)

raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all ('img')
links = []
for img in imgs:
    link = img.get('scr')
    #if 'http://' not in link:
        #link = url + link
    links.append(link)

print('Image detected: ' + str(len(links)))

for i in range(len(links)):
    filename = str(img.jpg) .format(i)
    urllib.request.urlretrieve(links[i], filename)
    print('Done!')


here is the error:



URL: http://legacy.lib.utexas.edu/maps/topo/indiana/
Image detected: 35
Traceback (most recent call last):
File "C:\Users\Kite\Desktop\scraping Images\TUT_7\test_1.py", line 25, in <module>
urllib.request.urlretrieve(links[i], filename)
File "C:\Python36\lib\urllib\request.py", line 246, in urlretrieve
url_type, path = splittype(url)
File "C:\Python36\lib\urllib\parse.py", line 954, in splittype
match = _typeprog.match(url)
TypeError: expected string or bytes-like object

It looks like I need to turn something to a string. If anyone can give me an hand that would be nice.
Quote
#2
There are several problems here,so not even close to work Wink
Before writing more code most test that what you get back is acutely usable.
print() always work as fast test,or here i use pprint() then is easier to look at content.
import requests
import bs4 as bs
import urllib.request
from pprint import pprint

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all ('img')
pprint(imgs)
Output:
[<img alt="The University of Texas" src="/images/globalHeaderFooter/university_seal_informal.png"/>, <img alt="The University of Texas" src="/images/globalHeaderFooter/UT_Libraries_RGB_inf_brand_b2-ac.svg"/>, <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>, <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>, <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>, <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>, <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>, ..... ect
As see this is not images links of maps that you want.

So can help write the start as this is not usable.
I gone trow awyay urllib as that should not be used anyway.
import requests
from bs4 import BeautifulSoup

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')
for link in map_link:
    print(link.get('href'))
Output:
http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg http://legacy.lib.utexas.edu/maps/topo/indiana/txu-pclmaps-topo-in-bedford-1934.jpg http://legacy.lib.utexas.edu/maps/topo/illinois/txu-pclmaps-topo-il-birds-1914.jpg http://legacy.lib.utexas.edu/maps/topo/indiana/txu-pclmaps-topo-in-bloomington-1908.jpg ..... ect
So now can try to figure out how to download these image links,and you do not need to import urllib for this.
Quote
#3
Thank you snippsat. Be for I start working on the download part, I want to understand the code you gave me. I will be back
Quote
#4
Ok, I been working on this half the night. Just can't get it to work. I am just not sure how to download all the maps. Their will be many files, I can download one map but that is it. here is what I been working with. the last of many ways.

import requests
from bs4 import BeautifulSoup
 
url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')
for link in map_link:
    print(link.get('href'))
for map in maps:
    with open("maps.jpg", "wb") as file:
        file.write(response.content)
file.close    


I have the script in its own folder so the maps should be saved in the folder that the script is in. Maybe a while loop will work better. I am just lost on saving the images.
Quote
#5
(Jun-23-2020, 11:50 AM)Blue Dog Wrote: I have the script in its own folder so the maps should be saved in the folder that the script is in. Maybe a while loop will work better. I am just lost on saving the images.
The loop is already done in my code,so inside this loop can use os.path.basename to get correct names of images when save.
Then to get the content(bytes) of images need also to open links with Requests,then can save.
Here also put in a progress bar with tqdm.
import requests
from bs4 import BeautifulSoup
import os
# pip install tqdm
from tqdm import tqdm

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')[:-1]
for link in tqdm(map_link):
    img_name = os.path.basename(link.get('href'))
    #print(img_name)
    img = requests.get(link.get('href'))
    with open(img_name, 'wb') as f_out:
            f_out.write(img.content)
Quote
#6
WoW, works great. I did not have to install tqdm, so it must have been installed.
I downloaded the Doc for os.(os.path.basename) Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). what does that mean?
*****************************************
img = requests.get(link.get('href'))
this is a get request for all links with 'href'
****************************************************
with open(img_name, 'wb') as f_out:
I think this open a file that you can put Img_name in.
**************************************************************
f_out.write(img.content)
This writ the img to the file.

If I am wrong on any of the line let me know. I see how you name the file to be download. that was one of the big problem I had, I was think each file needed a new name.

Thank you so much, I do a lot of metal detecting and I been making small program to help me get stuff off the net.

snippsat, I just downloaded your tut on scraping. Will read it tonight.
Thanks
Quote
#7
(Jun-23-2020, 07:12 PM)Blue Dog Wrote: I downloaded the Doc for os.(os.path.basename) Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). what does that mean?
It helps to use interactive shell to test stuff like this,a better REPL like ptpython or IPython also helps.
>>> import os

>>> help(os.path.basename)
Help on function basename in module ntpath:

basename(p)
    Returns the final component of a pathname

>>> url = 'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'
>>> os.path.basename(url)
'txu-pclmaps-topo-in-index-1925.jpg'
So it's a simple functionality,it's not hard to write this.
>>> url = 'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'
>>> url.split('/')[-1]
'txu-pclmaps-topo-in-index-1925.jpg'
Blue Dog Wrote:img = requests.get(link.get('href'))
this is a get request for all links with 'href'
No the links is already found with find_all('a')
href is to get bare image link out of image link tag found.
>>> map_link[0]
<a href="http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg">Indiana - Topographic Map Index 1925</a>
>>> map_link[0].get('href')
'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'
Blue Dog Wrote:with open(img_name, 'wb') as f_out:
I think this open a file that you can put Img_name in.
**************************************************************
f_out.write(img.content)
This writ the img to the file
Thumbs Up
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  RFC downloader not working sidsr003 2 727 Dec-19-2018, 09:31 PM
Last Post: snippsat
  How to make my code work with asyncio? DevinGP 0 1,412 Jan-09-2018, 06:21 PM
Last Post: DevinGP
  Multiple File Downloader Josh_Python890 1 1,157 Sep-16-2017, 11:19 PM
Last Post: Larz60+

Forum Jump:


Users browsing this thread: 1 Guest(s)