Can not make this image downloader work

Can not make this image downloader work - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Can not make this image downloader work (/thread-27800.html)

Can not make this image downloader work - Blue Dog - Jun-22-2020

Hi,
I am trying to make a maps down loader. I got everything working good, but for the last for loop. I been working on this for a lone time. here id the code:

import requests
import bs4 as bs
import urllib.request

url = str(input('URL: '))

opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)

raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all ('img')
links = []
for img in imgs:
    link = img.get('scr')
    #if 'http://' not in link:
        #link = url + link
    links.append(link)

print('Image detected: ' + str(len(links)))

for i in range(len(links)):
    filename = str(img.jpg) .format(i)
    urllib.request.urlretrieve(links[i], filename)
    print('Done!')

here is the error:

URL: http://legacy.lib.utexas.edu/maps/topo/indiana/
Image detected: 35
Traceback (most recent call last):
File "C:\Users\Kite\Desktop\scraping Images\TUT_7\test_1.py", line 25, in <module>
urllib.request.urlretrieve(links[i], filename)
File "C:\Python36\lib\urllib\request.py", line 246, in urlretrieve
url_type, path = splittype(url)
File "C:\Python36\lib\urllib\parse.py", line 954, in splittype
match = _typeprog.match(url)
TypeError: expected string or bytes-like object

It looks like I need to turn something to a string. If anyone can give me an hand that would be nice.

RE: Can not make this image downloader work - snippsat - Jun-22-2020

There are several problems here,so not even close to work Wink

Before writing more code most test that what you get back is acutely usable.
print() always work as fast test,or here i use pprint() then is easier to look at content.

import requests
import bs4 as bs
import urllib.request
from pprint import pprint

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
opener = urllib.request.build_opener()
opener.add_headers = [{'User-Agent' : 'Mozilla'}]
urllib.request.install_opener(opener)
raw = requests.get(url).text
soup = bs.BeautifulSoup(raw, 'html.parser')
imgs = soup.find_all ('img')
pprint(imgs)

Output:[<img alt="The University of Texas" src="/images/globalHeaderFooter/university_seal_informal.png"/>,
 <img alt="The University of Texas" src="/images/globalHeaderFooter/UT_Libraries_RGB_inf_brand_b2-ac.svg"/>,
 <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>,
 <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>,
 <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>,
 <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>,
 <img alt="" height="3" src="http://legacy.lib.utexas.edu/graphics/orange.gif" width="5"/>,
..... ect

As see this is not images links of maps that you want.

So can help write the start as this is not usable.
I gone trow awyay urllib as that should not be used anyway.

import requests
from bs4 import BeautifulSoup

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')
for link in map_link:
    print(link.get('href'))

Output:http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg
http://legacy.lib.utexas.edu/maps/topo/indiana/txu-pclmaps-topo-in-bedford-1934.jpg
http://legacy.lib.utexas.edu/maps/topo/illinois/txu-pclmaps-topo-il-birds-1914.jpg
http://legacy.lib.utexas.edu/maps/topo/indiana/txu-pclmaps-topo-in-bloomington-1908.jpg
..... ect

So now can try to figure out how to download these image links,and you do not need to import urllib for this.

RE: Can not make this image downloader work - Blue Dog - Jun-22-2020

Thank you snippsat. Be for I start working on the download part, I want to understand the code you gave me. I will be back

RE: Can not make this image downloader work - Blue Dog - Jun-23-2020

Ok, I been working on this half the night. Just can't get it to work. I am just not sure how to download all the maps. Their will be many files, I can download one map but that is it. here is what I been working with. the last of many ways.

import requests
from bs4 import BeautifulSoup
 
url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')
for link in map_link:
    print(link.get('href'))
for map in maps:
    with open("maps.jpg", "wb") as file:
        file.write(response.content)
file.close

I have the script in its own folder so the maps should be saved in the folder that the script is in. Maybe a while loop will work better. I am just lost on saving the images.

RE: Can not make this image downloader work - snippsat - Jun-23-2020

(Jun-23-2020, 11:50 AM)Blue Dog Wrote: I have the script in its own folder so the maps should be saved in the folder that the script is in. Maybe a while loop will work better. I am just lost on saving the images.

The loop is already done in my code,so inside this loop can use os.path.basename to get correct names of images when save.
Then to get the content(bytes) of images need also to open links with Requests,then can save.
Here also put in a progress bar with tqdm.

import requests
from bs4 import BeautifulSoup
import os
# pip install tqdm
from tqdm import tqdm

url = 'http://legacy.lib.utexas.edu/maps/topo/indiana/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
maps = soup.select_one('#actualcontent > ul')
map_link = maps.find_all('a')[:-1]
for link in tqdm(map_link):
    img_name = os.path.basename(link.get('href'))
    #print(img_name)
    img = requests.get(link.get('href'))
    with open(img_name, 'wb') as f_out:
            f_out.write(img.content)

RE: Can not make this image downloader work - Blue Dog - Jun-23-2020

WoW, works great. I did not have to install tqdm, so it must have been installed.
I downloaded the Doc for os.(os.path.basename) Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). what does that mean?
*****************************************
img = requests.get(link.get('href'))
this is a get request for all links with 'href'
****************************************************
with open(img_name, 'wb') as f_out:
I think this open a file that you can put Img_name in.
**************************************************************
f_out.write(img.content)
This writ the img to the file.

If I am wrong on any of the line let me know. I see how you name the file to be download. that was one of the big problem I had, I was think each file needed a new name.

Thank you so much, I do a lot of metal detecting and I been making small program to help me get stuff off the net.

snippsat, I just downloaded your tut on scraping. Will read it tonight.
Thanks

RE: Can not make this image downloader work - snippsat - Jun-23-2020

(Jun-23-2020, 07:12 PM)Blue Dog Wrote: I downloaded the Doc for os.(os.path.basename) Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). what does that mean?

It helps to use interactive shell to test stuff like this,a better REPL like ptpython or IPython also helps.

>>> import os

>>> help(os.path.basename)
Help on function basename in module ntpath:

basename(p)
    Returns the final component of a pathname

>>> url = 'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'
>>> os.path.basename(url)
'txu-pclmaps-topo-in-index-1925.jpg'

So it's a simple functionality,it's not hard to write this.

>>> url = 'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'
>>> url.split('/')[-1]
'txu-pclmaps-topo-in-index-1925.jpg'

Blue Dog Wrote:img = requests.get(link.get('href'))
this is a get request for all links with 'href'

No the links is already found with find_all('a')
href is to get bare image link out of image link tag found.

>>> map_link[0]
<a href="http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg">Indiana - Topographic Map Index 1925</a>
>>> map_link[0].get('href')
'http://legacy.lib.utexas.edu/maps/topo/indexes/txu-pclmaps-topo-in-index-1925.jpg'

Blue Dog Wrote:with open(img_name, 'wb') as f_out:
I think this open a file that you can put Img_name in.
**************************************************************
f_out.write(img.content)
This writ the img to the file