Hey there! Im about to try your script out but Im almost 100% sure I know whats going on. I have to admit that I gave bs a once over years ago and have been married to scrapy (we have a special connection... ?lol)... BUT when your doing your parsing, in scrapys case (as beautifulSoup) theres' a default header or "User Agent" Profile.. Hmmmm .. Cant be much different...
Just google "adding user agent header to beatifulsoup" and tada! but... I
# After import what you need.... You can either list multiple header profiles...
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
'Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
]
#The when Calling the start or base url you pass the "headers = "... usining choice randomize you can have this list and be...well not sneaky because unless your proxifying its not neccesary..
# for each url entry of a row in the text file get
# lead info from yelp related to that url...
for dat in linksandsuch:
version = choice(user_agents)
headers = { 'User-Agent' : version }
##### What I would do? Just a single agent defined by header value..
#...
#for dat in linksandsuch:
#headers = { 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)' }
#
#
....
If Im wrong shoot me a message yes? Im having issue with scrapys image download function (specifically the renaming of the image not the dl) and I can script something real quick for ya... bu teach a man to fish right? lol
Wait... I'm noticing your download method... are you writing the image?
One google search an 30 seconds later...
In Python 3.x, urllib.request.urlretrieve can be used to download files from any
remote URL:
Not sure where you got that download method which Im guessing it works if you writing directly from the url you called it from... here you trying to get the img .... to respond like it was a page... but forbidden? w.e lol Try urlretrive for you download function... google what you must.
----
#Edit Update!
So I went ahead and ran your script... Donloaded on image ... lol but no 505....??? Maybe your IP got blocked ... ??? try adding delays to your script and lower you throttle??.