Python Forum
HTTPError: Forbidden when try download image
Thread Rating:
  • 2 Vote(s) - 4.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
HTTPError: Forbidden when try download image
#1
i want to download picture on wallhaven.cc and i can get picture url but image is not download its give an error ; 

my code is ;

import urllib.request
from bs4 import BeautifulSoup

imdbUrl="htt"+"ps:"+"//alpha.wallhaven.cc"+"/random?page=4"
r=requests.get(imdbUrl)

soup=BeautifulSoup(r.content,"html.parser")

kelimeler=soup.find_all("img",{"class":"lazyload"})

say=0
for i in kelimeler:
    say +=1
    url=str(i['data-src'])
    url=url.replace("alpha","wallpapers")
    url=url.replace("/thumb/small/th-","/full/wallhaven-")
    url=url.replace("https","http")
    yeniad=str(say)+".jpg"
    url=url.strip()
    print(url)
    urllib.request.urlretrieve(url,yeniad)
but its give an error like this ; 
Error:
HTTPError: Forbidden
Reply
#2
what gets printed from:
print(url)
what the heck is this?
imdbUrl="htt"+"ps:"+"//alpha.wallhaven.cc"+"/random?page=4"
why not
imdbUrl = 'https://alpha.wallhaven.cc/random?page=4'
Reply
#3
According to documentation there is no urllib.request.get() method. I didn't find it. There is urllib.request.urlopen()
It's better to use Requests
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
(Jan-20-2017, 07:55 PM)Larz60+ Wrote: what gets printed from: its show image URL
print(url)
what the heck is this? i use this because i cant open thread when its link , i use this like seening your write
imdbUrl="htt"+"ps:"+"//alpha.wallhaven.cc"+"/random?page=4"
why not
imdbUrl = 'https://alpha.wallhaven.cc/random?page=4'

(Jan-20-2017, 08:43 PM)wavic Wrote: According to documentation there is no urllib.request.get() method. I didn't find it. There is urllib.request.urlopen()
It's better to use Requests

please write it for me , i cant understand anything
Reply
#5
They are blocking urllib,but it work with Requests(as you should use anyway).
Quote:please write it for me , i cant understand anything
You should try yourself,but to be nice here how to download 1 image.
You always do test like this,before you are making a loop Undecided
from bs4 import BeautifulSoup
import requests
import os

page = 4
url = 'https://alpha.wallhaven.cc/random?page={}'.format(page)
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
# Parse
kelimeler = soup.find("img", {"class":"lazyload"})
img_nr = os.path.basename(kelimeler['data-src'])
img_nr = img_nr.split('-')[-1]
img_large = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-{}'.format(img_nr)
# Download
down_link = requests.get(img_large)
with open(img_nr, "wb") as img_obj:
    img_obj.write(down_link.content)
Reply
#6
Hey there! Im about to try your script out but Im almost 100% sure I know whats going on.  I have to admit that I gave bs a once over years ago and have been married to scrapy (we have a special connection... ?lol)... BUT when your doing your parsing, in scrapys case (as beautifulSoup) theres' a default header or "User Agent" Profile.. Hmmmm .. Cant be much different...


Just google "adding user agent header to beatifulsoup" and tada!  but... I

# After import what you need.... You can either list multiple header profiles...
user_agents = [
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
    'Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
]

#The when Calling the start or base url you pass the "headers = "... usining choice randomize you can have this list and be...well not sneaky because unless your proxifying its not neccesary..

# for each url entry of a row in the text file get 
# lead info from yelp related to that url...

for dat in linksandsuch:
version = choice(user_agents)
headers = { 'User-Agent' : version }


##### What I would do? Just a single agent defined by header value..
#...
#for dat in linksandsuch:
#headers = {  'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)' }
#    
#
....
  If Im wrong shoot me a message yes?  Im having issue with scrapys image download function (specifically the renaming of the image not the dl) and I can script something real quick for ya... bu teach a man to fish right?  lol

Wait... I'm noticing your download method...  are you writing the image?

One google search an 30 seconds later...

In Python 3.x, urllib.request.urlretrieve can be used to download files from any
remote URL:

Not sure where you got that download method which Im guessing it works if you writing directly from the url you called it from... here you trying to get the img .... to respond like it was a page... but forbidden?  w.e lol Try urlretrive for you download function... google what you must.

----
#Edit Update!

So I went ahead and ran your script... Donloaded on image ... lol but no 505....??? Maybe your IP got blocked ... ??? try adding delays to your script and lower you throttle??.
Reply
#7
You are missing comma after the first user agent string
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
@scriptso you write a little messy Wink
You are right that setting user-agent header can solve it for urllib.
But the clear message is that urllib should not be used,when we have Requests.

Can fix urlretrieve() bye using opener.retrieve().
That can take user-agent header.
>>> import urllib.request
>>> img = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-293122.jpg'
>>> urllib.request.urlretrieve(img, '1.jpg')
Traceback (most recent call last):  
urllib.error.HTTPError: HTTP Error 403: Forbidden

>>> # Fix it
>>> opener = urllib.request.FancyURLopener({}) 
>>> opener.version = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36'
>>> opener.retrieve(img, '1.jpg')
('1.jpg', <http.client.HTTPMessage object at 0x038F2210>)
Reply
#9
(Jan-21-2017, 10:03 AM)snippsat Wrote: @scriptso you write a little messy Wink You are right that setting user-agent header can solve it for urllib. But the clear message is that urllib should not be used,when we have Requests. Can fix urlretrieve() bye using opener.retrieve(). That can take user-agent header.
>>> import urllib.request >>> img = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-293122.jpg' >>> urllib.request.urlretrieve(img, '1.jpg') Traceback (most recent call last):   urllib.error.HTTPError: HTTP Error 403: Forbidden >>> # Fix it >>> opener = urllib.request.FancyURLopener({}) >>> opener.version = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.69 Safari/537.36' >>> opener.retrieve(img, '1.jpg') ('1.jpg', <http.client.HTTPMessage object at 0x038F2210>)

LMAO ! I get that a lot =( .... product of insomnia + scatter brain ... good stuff! I totally mixed up your fix and the original poster... maybe it is time for sleep X_x I 'durped' that up. *sigh*
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  403 Forbidden Error Evil_Patrick 1 4,351 Jun-20-2020, 02:19 PM
Last Post: snippsat
  urllib.error.HTTPError: HTTP Error 404: Not Found ckkkkk 4 8,631 Mar-03-2020, 11:30 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020