Python Forum
Image Scraper (beautifulsoup), stopped working, need to help see why
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Image Scraper (beautifulsoup), stopped working, need to help see why
#6
Hey man, Thanks so much for your help. Seeing how you're doing stuff is helping me pick up some of the stuff I'm missing. There's one part I'm not getting. I'm trying to grab a name for the download file. I can't see how to grab the "download" tag from the link. It's not in the "img". I'd be fine id'ing it with something different, except, I can't seem to grab any other tag like "href" in the "img". Here's what I'm seeing when I print it:

<a class="thread_image_link" href="https://i.4pcdn.org/hr/1487897017177.jpg" rel="noreferrer" target="_blank"> <img class="lazyload post_image" data-md5="eZ7TVjlctRtTdBUtec1WKQ==" height="94" loading="lazy" src="https://i.4pcdn.org/hr/1487897017177s.jpg" width="124"/> </a>
When i look at the source file, I see that there is a "download" tag with the img name:

target="_blank" class="btnr parent">SauceNAO</a><a href="https://trace.moe/?url=https://i.4pcdn.org/hr/1487896485361.jpg" target="_blank" class="btnr parent">Trace</a><a href="//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg" download="baboo.jpg" class="btnr parent"><i class="icon-download-alt">
But I can't find a way to grab that one. I'm sure it has to do with the initial
img_all = soup.select('div.thread_image_box > a')
here's the link again
https://archive.4plebs.org/hr/thread/2866456/

But I can't see how to change it to work. Is there any way to grab that download tag?

How would I remove the first part of the link, if I wanted to just use the existing number.jpg name?
like:

//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg

becomes:

1487896485361.jpg


I keep tinkering, and I'm able to run it now, and just add a name and a number and have it increment. One issue is though, If the file isn't a .jpg, then my method would totally bork it up! So, being able to just grab the file name and extension would be super helpful. But here's what I have now.

the code:

##########################################
#######    This is section for the main imports
import requests
import wget
import os

from bs4 import BeautifulSoup
from tqdm import tqdm
from urllib.parse import urljoin, urlparse
from time import time
from multiprocessing.pool import ThreadPool
from concurrent.futures import ThreadPoolExecutor
from time import sleep

##########################################
#######    This is section for choosing site and save folder
url = ''
folder = ''
name = ''
number = 1

url = input("Website:")
folder = input("Folder:")
name = input("Name:")



headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
##########################################
#######    
r = requests.get(url, headers=headers)
response = requests.get(url, headers=headers)
data = r.text
#soup = BeautifulSoup(data, features = "lxml")
soup = BeautifulSoup(response.content, 'lxml')
img_all = soup.select('div.thread_image_box > a')



##########################################
#######    This section grabs all pictures tagged download and makes folders
#for img in img_all:
    #print(img.get('href'))

#for tag in soup.select('a.parent[download]'):
for img in img_all:
    dlthis = (img.get('href'))
    dlid = img
    #path = os.path.join(folder, tag['download'])
    strnum = str(number)
    newnum = " " + strnum
    namestr = name + newnum + ".jpg"
    #print(namestr)
    path = os.path.join(folder, namestr)
    #myfile = requests.get(dlthis, allow_redirects=True, stream = True)
    myfile=requests.get(dlthis, allow_redirects=True, stream = True)
    if not os.path.isdir(folder):
        os.makedirs(folder)

##########################################
#######    Section for Saving Files, both work    
#    with open(path, 'wb') as f:
#        f.write(myfile.content)
    open(path, 'wb').write(myfile.content)
    
    number = number + 1
    
    
    

    
##########################################
Reply


Messages In This Thread
RE: Image Scraper (beautifulsoup), stopped working, need to help see why - by woodmister - Jan-05-2021, 04:10 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scraper tomenzo123 8 4,458 Aug-18-2023, 12:45 PM
Last Post: Gaurav_Kumar
  Web scraper not populating .txt with scraped data BlackHeart 5 1,553 Apr-03-2023, 05:12 PM
Last Post: snippsat
  BeautifulSoup Showing none while extracting image url josephandrew 0 1,966 Sep-20-2021, 11:40 AM
Last Post: josephandrew
  Web scrapping - Stopped working peterjv26 2 3,126 Sep-23-2020, 08:30 AM
Last Post: peterjv26
  not getting image src in my BeautifulSoup csv file farhan275 11 3,802 Sep-14-2020, 04:52 PM
Last Post: buran
  Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL MidnightDreamer 4 3,054 Mar-12-2020, 09:57 AM
Last Post: BrandonKastning
  Python using BS scraper paulfearn100 1 2,587 Feb-07-2020, 10:22 PM
Last Post: snippsat
  web scraper using pathlib Larz60+ 1 3,236 Oct-16-2017, 05:27 PM
Last Post: Larz60+
  Need alittle hlpl with an image scraper. Blue Dog 8 7,795 Dec-24-2016, 08:09 PM
Last Post: Blue Dog
  Made a very simple email grabber(scraper) Blue Dog 4 6,934 Dec-13-2016, 06:25 AM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020