Python Forum

Full Version: Two part scraping?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I'm trying to figure out 2 things.
1st) Scrape a page for all it's News Links
2nd) Using those scraped links - open each link and scrape that page for Title/Article/Image

I'm totally new to Python and trying to pick up little things here and there.

I have some code.

This bit of code will get the title and content - but I can't figure out how to get the image URL - i've not put any code into this example.

from bs4 import BeautifulSoup
import requests, openpyxl

excel = openpyxl.Workbook()
print(excel.sheetnames)
sheet = excel.active
sheet.title = "News"
print(excel.sheetnames)
sheet.append(['title', 'body', 'image'])

source = requests.get('https://portswigger.net/daily-swig/couple-charged-with-laundering-proceeds-from-4-5bn-bitfinex-cryptocurrency-hack')
source.raise_for_status()

soup = BeautifulSoup(source.text, 'html.parser')
text = soup.find_all(class_="post-card")
for news in text:
    title = news.find('h1').text
    body = news.find(class_="post-content").text
    image = image.find()

print(title,body, image)
sheet.append([title, body, image])

excel.save('news1.xlsx')
As for scraping the URLs, i'm not having much luck.

The site is https://portswigger.net/daily-swig/dark-web

the code is this - but probably no point in even adding this and nothing is working
from bs4 import BeautifulSoup
import requests, openpyxl


source = requests.get('https://portswigger.net/daily-swig/dark-web')
source.raise_for_status()

soup = BeautifulSoup(source.text, 'html.parser')
text = soup.find_all('div', class_="tile-container is-absolute dailyswig size0 style1 textstyle7")

print(text)
But even if this was working I still don't know how I can take those links and feed them into the script to scrape the title/content/image and put that into a xlsx file.

Any help would be great.
Thanks.
(Feb-22-2022, 02:36 PM)never5000 Wrote: [ -> ]but I can't figure out how to get the image URL
Don't mess with save to excel or loops before you find/know how to get the content.
Under post card class there is one image,get scr back that is a relative url so need to join with base url use to get a working link.
Example:
from bs4 import BeautifulSoup
import requests, openpyxl

source = requests.get('https://portswigger.net/daily-swig/couple-charged-with-laundering-proceeds-from-4-5bn-bitfinex-cryptocurrency-hack')
soup = BeautifulSoup(source.content, 'html.parser')
card_class = soup.find_all(class_="post-card")
>>> img = card_class[0].select_one('img')
>>> img
<img alt="Husband and wife charged with laundering proceeds from $4.5bn Bitfinex cryptocurrency hack " src="/cms/images/fa/5e/29b8-article-220209-bitfinex.png" title="Vladimir Kazakov / Shutterstock"/>
>>> img = card_class[0].select_one('img').get('src')
>>> img
'/cms/images/fa/5e/29b8-article-220209-bitfinex.png'

>>> base_url = 'https://portswigger.net'
>>> img_link = f'{base_url}{img}'
>>> img_link
'https://portswigger.net/cms/images/fa/5e/29b8-article-220209-bitfinex.png'
Quote:But even if this was working I still don't know how I can take those links and feed them into the script to scrape the title/content/image and put that into a xlsx file.
As mention you start bye testing small and try to turn off JavaScript in browser for this site https://portswigger.net/daily-swig/dark-web
Error:
This page requires JavaScript for an enhanced user experience.
Then need other tool like Selenium.
A example
Web-scraping part-2.
Or search site there 100's of posts how to use Selenium alone or give content to BS for scraping.