Jan-05-2021, 04:10 PM
(This post was last modified: Jan-05-2021, 04:10 PM by woodmister.)
Hey man, Thanks so much for your help. Seeing how you're doing stuff is helping me pick up some of the stuff I'm missing. There's one part I'm not getting. I'm trying to grab a name for the download file. I can't see how to grab the "download" tag from the link. It's not in the "img". I'd be fine id'ing it with something different, except, I can't seem to grab any other tag like "href" in the "img". Here's what I'm seeing when I print it:
https://archive.4plebs.org/hr/thread/2866456/
But I can't see how to change it to work. Is there any way to grab that download tag?
How would I remove the first part of the link, if I wanted to just use the existing number.jpg name?
like:
//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg
becomes:
1487896485361.jpg
I keep tinkering, and I'm able to run it now, and just add a name and a number and have it increment. One issue is though, If the file isn't a .jpg, then my method would totally bork it up! So, being able to just grab the file name and extension would be super helpful. But here's what I have now.
the code:
<a class="thread_image_link" href="https://i.4pcdn.org/hr/1487897017177.jpg" rel="noreferrer" target="_blank"> <img class="lazyload post_image" data-md5="eZ7TVjlctRtTdBUtec1WKQ==" height="94" loading="lazy" src="https://i.4pcdn.org/hr/1487897017177s.jpg" width="124"/> </a>When i look at the source file, I see that there is a "download" tag with the img name:
target="_blank" class="btnr parent">SauceNAO</a><a href="https://trace.moe/?url=https://i.4pcdn.org/hr/1487896485361.jpg" target="_blank" class="btnr parent">Trace</a><a href="//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg" download="baboo.jpg" class="btnr parent"><i class="icon-download-alt">But I can't find a way to grab that one. I'm sure it has to do with the initial
img_all = soup.select('div.thread_image_box > a')here's the link again
https://archive.4plebs.org/hr/thread/2866456/
But I can't see how to change it to work. Is there any way to grab that download tag?
How would I remove the first part of the link, if I wanted to just use the existing number.jpg name?
like:
//archive.4plebs.org/dl/hr/image/1487/89/1487896485361.jpg
becomes:
1487896485361.jpg
I keep tinkering, and I'm able to run it now, and just add a name and a number and have it increment. One issue is though, If the file isn't a .jpg, then my method would totally bork it up! So, being able to just grab the file name and extension would be super helpful. But here's what I have now.
the code:
########################################## ####### This is section for the main imports import requests import wget import os from bs4 import BeautifulSoup from tqdm import tqdm from urllib.parse import urljoin, urlparse from time import time from multiprocessing.pool import ThreadPool from concurrent.futures import ThreadPoolExecutor from time import sleep ########################################## ####### This is section for choosing site and save folder url = '' folder = '' name = '' number = 1 url = input("Website:") folder = input("Folder:") name = input("Name:") headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'} ########################################## ####### r = requests.get(url, headers=headers) response = requests.get(url, headers=headers) data = r.text #soup = BeautifulSoup(data, features = "lxml") soup = BeautifulSoup(response.content, 'lxml') img_all = soup.select('div.thread_image_box > a') ########################################## ####### This section grabs all pictures tagged download and makes folders #for img in img_all: #print(img.get('href')) #for tag in soup.select('a.parent[download]'): for img in img_all: dlthis = (img.get('href')) dlid = img #path = os.path.join(folder, tag['download']) strnum = str(number) newnum = " " + strnum namestr = name + newnum + ".jpg" #print(namestr) path = os.path.join(folder, namestr) #myfile = requests.get(dlthis, allow_redirects=True, stream = True) myfile=requests.get(dlthis, allow_redirects=True, stream = True) if not os.path.isdir(folder): os.makedirs(folder) ########################################## ####### Section for Saving Files, both work # with open(path, 'wb') as f: # f.write(myfile.content) open(path, 'wb').write(myfile.content) number = number + 1 ##########################################