Python Forum
Image Scraper (beautifulsoup), stopped working, need to help see why
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Image Scraper (beautifulsoup), stopped working, need to help see why
#1
I wrote a little script 6 months or so ago with some help from a friend. It looked at a website and got the images from it. It used to work, but stopped working a week or so ago, on any machine I have. I'm really new to all this, and had to piece together the first one I wrote before we got it cleaned up.

I don't get any error message at all. So it's hard to troubleshoot what could have changed.

Here's the website I'm trying to get images from:
https://archive.4plebs.org/hr/thread/2866456/

Here is the code I've been using. I went through lots of iterations but this was the final one I had.

##########################################
#######    This is section for the main imports
import requests
import wget
import os

from bs4 import BeautifulSoup
from tqdm import tqdm
from urllib.parse import urljoin, urlparse
from time import time
from multiprocessing.pool import ThreadPool
from concurrent.futures import ThreadPoolExecutor
from time import sleep

##########################################
#######    This is section for choosing site and save folder
url = ''
folder = ''

url = input("Website:")
folder = input("Folder:")

##########################################
#######    This section I have NO idea what it does.  :)  Sets parser for sure
r  = requests.get(url, stream = True)
data = r.text
soup = BeautifulSoup(data, features = "lxml")

##########################################
#######    This section grabs all pictures tagged download and makes folders
for tag in soup.select('a.parent[download]'):
    dlthis = ('https:' + tag['href'])
    path = os.path.join(folder, tag['download'])
    myfile = requests.get(dlthis, allow_redirects=True, stream = True)
    if not os.path.isdir(folder):
        os.makedirs(folder)

##########################################
#######    Section for Saving Files, both work    
#    with open(path, 'wb') as f:
#        f.write(myfile.content)
    open(path, 'wb').write(myfile.content)
    
##########################################
I have iterations that do multi-thread, and basic ones that just print out the links. But, I can't seem to get it to show anything at all. I'm sure it has something to do with the request and parse from beautifulsoup

Any help you can give would be awesome. Thank You!

So, before I posted this, I wanted to make sure I tested everything I knew to test. So, I played around with it a little more. and it looks like there is a security feature installed now to probably block exactly what i'm trying to do... So, is there any way around it? Or a better way to pull pictures? Here's what I'm seeing:

h1>Access denied</h1>
  <p>This website is using a security service to protect itself from online attacks.</p>
  <ul class="cferror_details">
    <li>Ray ID: 60c8a5d2cc2b3a02</li>
    <li>Timestamp: 2021-01-04 23:13:01 UTC</li>
Thanks
Reply
#2
Try set User-Agent.
Here a quick test.
import requests
from bs4 import BeautifulSoup

url = 'https://archive.4plebs.org/hr/thread/2866456/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
img_test = soup.select_one('div.thread_image_box > a > img')
print(img_test)
print(img_test.get('src'))
Output:
/hr/ - High Resolution ยป Thread #2866456 <img class="thread_image" data-md5="6SrRaBpXnbWSj3IZ9bTr+g==" height="176" loading="lazy" src="https://i.4pcdn.org/hr/1487896415236s.jpg" width="250"/> https://i.4pcdn.org/hr/1487896415236s.jpg
Reply
#3
That's interesting. I'll have to figure out how to work that into the script. It looks like it might work. Thanks fro the response, and I'll follow up if I can't logic out how it all works. :)
Reply
#4
Actually, that works great, but it doesn't get the full sized image behind the smaller image. So, here is the link for the small one:

https://i.4pcdn.org/hr/1487896415236s.jpg

but here is the file I want:

https://i.4pcdn.org/hr/1487896415236.jpg

So, I'll have to figure that out unless you know a quick way to manupulate it. I haven't fully tested the script you gave me yet, but I did notice it wasn't getting the full sized one.

<a href="https://i.4pcdn.org/hr/1487896415236.jpg" target="_blank" rel="noreferrer" class="thread_image_link">
                        															<img loading="lazy" src="https://i.4pcdn.org/hr/1487896415236s.jpg" width="250" height="176" class="thread_image" data-md5="6SrRaBpXnbWSj3IZ9bTr+g==" />
That's how it shows in the source. Unfortunately, I'm not good enough to do much with that.
Reply
#5
(Jan-05-2021, 12:00 AM)woodmister Wrote: So, I'll have to figure that out unless you know a quick way to manupulate it. I haven't fully tested the script you gave me yet, but I did notice it wasn't getting the full sized one.
My Code was just a quick test to see that user agent did work.
The full sized images are under tag a with href for large and src for small.
So if adjust a little.
import requests
from bs4 import BeautifulSoup

url = 'https://archive.4plebs.org/hr/thread/2866456/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
img_all = soup.select('div.thread_image_box > a')
for img in img_all:
    print(img.get('href'))
Output:
https://i.4pcdn.org/hr/1487896415236.jpg https://i.4pcdn.org/hr/1487896485361.jpg https://i.4pcdn.org/hr/1487896543620.jpg https://i.4pcdn.org/hr/1487896605850.jpg https://i.4pcdn.org/hr/1487896666111.jpg https://i.4pcdn.org/hr/1487896726234.jpg .....
Reply
#6
Hey man, Thanks so much for your help. Seeing how you're doing stuff is helping me pick up some of the stuff I'm missing. There's one part I'm not getting. I'm trying to grab a name for the download file. I can't see how to grab the "download" tag from the link. It's not in the "img". I'd be fine id'ing it with something different, except, I can't seem to grab any other tag like "href" in the "img". Here's what I'm seeing when I print it:

<a class="thread_image_link" href="https://i.4pcdn.org/hr/1487897017177.jpg" rel="noreferrer" target="_blank"> <img class="lazyload post_image" data-md5="eZ7TVjlctRtTdBUtec1WKQ==" height="94" loading="lazy" src="https://i.4pcdn.org/hr/1487897017177s.jpg" width="124"/> </a>
When i look at the source file, I see that there is a "download" tag with the img name:

target="_blank" class="btnr parent">SauceNAO</a><a href="https://trace.moe/?url=https://i.4pcdn.org/hr/1487896485361.jpg" target="_blank" class="btnr parent">Trace</a><a href="//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg" download="baboo.jpg" class="btnr parent"><i class="icon-download-alt">
But I can't find a way to grab that one. I'm sure it has to do with the initial
img_all = soup.select('div.thread_image_box > a')
here's the link again
https://archive.4plebs.org/hr/thread/2866456/

But I can't see how to change it to work. Is there any way to grab that download tag?

How would I remove the first part of the link, if I wanted to just use the existing number.jpg name?
like:

//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg

becomes:

1487896485361.jpg


I keep tinkering, and I'm able to run it now, and just add a name and a number and have it increment. One issue is though, If the file isn't a .jpg, then my method would totally bork it up! So, being able to just grab the file name and extension would be super helpful. But here's what I have now.

the code:

##########################################
#######    This is section for the main imports
import requests
import wget
import os

from bs4 import BeautifulSoup
from tqdm import tqdm
from urllib.parse import urljoin, urlparse
from time import time
from multiprocessing.pool import ThreadPool
from concurrent.futures import ThreadPoolExecutor
from time import sleep

##########################################
#######    This is section for choosing site and save folder
url = ''
folder = ''
name = ''
number = 1

url = input("Website:")
folder = input("Folder:")
name = input("Name:")



headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
##########################################
#######    
r = requests.get(url, headers=headers)
response = requests.get(url, headers=headers)
data = r.text
#soup = BeautifulSoup(data, features = "lxml")
soup = BeautifulSoup(response.content, 'lxml')
img_all = soup.select('div.thread_image_box > a')



##########################################
#######    This section grabs all pictures tagged download and makes folders
#for img in img_all:
    #print(img.get('href'))

#for tag in soup.select('a.parent[download]'):
for img in img_all:
    dlthis = (img.get('href'))
    dlid = img
    #path = os.path.join(folder, tag['download'])
    strnum = str(number)
    newnum = " " + strnum
    namestr = name + newnum + ".jpg"
    #print(namestr)
    path = os.path.join(folder, namestr)
    #myfile = requests.get(dlthis, allow_redirects=True, stream = True)
    myfile=requests.get(dlthis, allow_redirects=True, stream = True)
    if not os.path.isdir(folder):
        os.makedirs(folder)

##########################################
#######    Section for Saving Files, both work    
#    with open(path, 'wb') as f:
#        f.write(myfile.content)
    open(path, 'wb').write(myfile.content)
    
    number = number + 1
    
    
    

    
##########################################
Reply
#7
I can continue my code a little the you see how i would downloads images.
import requests
from bs4 import BeautifulSoup
from os import path

url = 'https://archive.4plebs.org/hr/thread/2866456/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
img_all = soup.select('div.thread_image_box > a')
for img in img_all:
    ref = img.get('href')
    img_down = requests.get(ref)
    print(f'Download --> {path.basename(ref)}')
    with open(path.basename(ref), 'wb') as f_out:
        f_out.write(img_down.content)
Output:
Download --> 1487896415236.jpg Download --> 1487896485361.jpg Download --> 1487896543620.jpg Download --> 1487896605850.jpg Download --> 1487896666111.jpg .....
Reply
#8
Holy crap dude. haha. seriously, holy crap. hahaha. It's beautiful. You must just laugh at how easy that is. You're good man. Mind Blown.

Since I got the other one to workish, I started looking at fixing the mutli-threaded one of the same model. I wasn't as much help in this one. Take a look?

Original code:

##########################################
#######    This is section for the main imports

import requests
import os
from bs4 import BeautifulSoup
from tqdm import tqdm
from multiprocessing.pool import ThreadPool

def save_image(tag):
    dlthis = ('https:' + tag['href'])
    print(dlthis)
    path = os.path.join(folder, tag['download'])
    myfile = requests.get(dlthis, allow_redirects=True, stream = True)
    ##########################################
    #######    Section for Saving Files, both work  
    #    with open(path, 'wb') as f:
    #        f.write(myfile.content)
    open(path, 'wb').write(myfile.content)
    ##########################################


if __name__ == '__main__':
    ##########################################
    #######    This is section for choosing site and save folder
    url = ''
    folder = ''

    url = input("Website:")
    folder = input("Folder:")

    if not os.path.isdir(folder):
        os.makedirs(folder)

    ##########################################
    #######    This section I have NO idea what it does.  :)  Sets parser for sure
    r  = requests.get(url, stream = True)
    data = r.text
    soup = BeautifulSoup(data, features = "lxml")

    ##########################################
    #######    This section grabs all pictures tagged download and makes folders

    images = soup.select('a.parent[download]')
    ThreadPool().map(save_image, images)



And my bastardized way of trying to get your fix to work on it.


##########################################
#######    This is section for the main imports

import requests
import os
from bs4 import BeautifulSoup
from tqdm import tqdm
from multiprocessing.pool import ThreadPool

def save_image(tag):
    dlthis = (img.get('href'))
    strnum = str(number)
    newnum = " " + strnum
    namestr = name + newnum + ".jpg"
    path = os.path.join(folder, namestr)
    myfile=requests.get(dlthis, allow_redirects=True, stream = True)
    ##########################################
    #######    Section for Saving Files, both work  
    #    with open(path, 'wb') as f:
    #        f.write(myfile.content)
    open(path, 'wb').write(myfile.content)
    ##########################################


if __name__ == '__main__':
    ##########################################
    #######    This is section for choosing site and save folder
    url = ''
    folder = ''
    name = ''
    number = 1
    
    url = input("Website:")
    folder = input("Folder:")
    name = input("Name:")
    
    
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}

    if not os.path.isdir(folder):
        os.makedirs(folder)

    ##########################################
    #######    This section I have NO idea what it does.  :)  Sets parser for sure
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'lxml')

    ##########################################
    #######    This section grabs all pictures tagged download and makes folders

    images = soup.select('div.thread_image_box > a')
    ThreadPool().map(save_image, images)
i think there is an issue with the ".map" and then the "img.get" in the function.
Reply
#9
Can show a example and i use my code Wink
Here use concurrent.futures

import requests
from bs4 import BeautifulSoup
from os import path
import concurrent.futures

def read_url(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'lxml')
    img_all = soup.select('div.thread_image_box > a')
    return img_all

def img_multi(img_link):
    print(f'Download --> {path.basename(img_link)}')
    with open(path.basename(img_link), 'wb') as f_out:
        f_out.write(requests.get(img_link).content)

if __name__ == '__main__':
    url = 'https://archive.4plebs.org/hr/thread/2866456/'
    img_all = read_url(url)
    # ThreadPoolExecutor | ProcessPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        for img in img_all:
            img_link = img.get('href')
            executor.submit(img_multi, img_link)
Reply
#10
Man, that's awesome. I love the concurrent.futures part! I went ahead though and bastardized your code too to ask me for a folder name and then change the name a titch. this is what I ended up with.

import requests
from bs4 import BeautifulSoup
from os import path
import os
import concurrent.futures
 
url = input("Website:")
folder = input("Folder:")

def read_url(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'lxml')
    img_all = soup.select('div.thread_image_box > a')
    return img_all
 
def img_multi(img_link):
    name = folder + ' - ' + path.basename(img_link)
    print(f'Download --> ',name)
    dlpath = os.path.join(folder, name)
    with open(dlpath, 'wb') as f_out:
        f_out.write(requests.get(img_link).content)
 
if __name__ == '__main__':
    img_all = read_url(url)
    if not os.path.isdir(folder):
        os.makedirs(folder)
    # ThreadPoolExecutor | ProcessPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        for img in img_all:
            img_link = img.get('href')
            executor.submit(img_multi, img_link)
took me far longer than you probably could image to figure out how to actually get it to ask for a name and then use it. But, it's there! Thanks for sharing your code!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraper tomenzo123 8 4,292 Aug-18-2023, 12:45 PM
Last Post: Gaurav_Kumar
  Web scraper not populating .txt with scraped data BlackHeart 5 1,450 Apr-03-2023, 05:12 PM
Last Post: snippsat
  BeautifulSoup Showing none while extracting image url josephandrew 0 1,905 Sep-20-2021, 11:40 AM
Last Post: josephandrew
  Web scrapping - Stopped working peterjv26 2 3,009 Sep-23-2020, 08:30 AM
Last Post: peterjv26
  not getting image src in my BeautifulSoup csv file farhan275 11 3,599 Sep-14-2020, 04:52 PM
Last Post: buran
  Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL MidnightDreamer 4 2,962 Mar-12-2020, 09:57 AM
Last Post: BrandonKastning
  Python using BS scraper paulfearn100 1 2,500 Feb-07-2020, 10:22 PM
Last Post: snippsat
  web scraper using pathlib Larz60+ 1 3,169 Oct-16-2017, 05:27 PM
Last Post: Larz60+
  Need alittle hlpl with an image scraper. Blue Dog 8 7,637 Dec-24-2016, 08:09 PM
Last Post: Blue Dog
  Made a very simple email grabber(scraper) Blue Dog 4 6,799 Dec-13-2016, 06:25 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020