Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Two part scraping?
#1
Hi,

I'm trying to figure out 2 things.
1st) Scrape a page for all it's News Links
2nd) Using those scraped links - open each link and scrape that page for Title/Article/Image

I'm totally new to Python and trying to pick up little things here and there.

I have some code.

This bit of code will get the title and content - but I can't figure out how to get the image URL - i've not put any code into this example.

from bs4 import BeautifulSoup
import requests, openpyxl

excel = openpyxl.Workbook()
print(excel.sheetnames)
sheet = excel.active
sheet.title = "News"
print(excel.sheetnames)
sheet.append(['title', 'body', 'image'])

source = requests.get('https://portswigger.net/daily-swig/couple-charged-with-laundering-proceeds-from-4-5bn-bitfinex-cryptocurrency-hack')
source.raise_for_status()

soup = BeautifulSoup(source.text, 'html.parser')
text = soup.find_all(class_="post-card")
for news in text:
    title = news.find('h1').text
    body = news.find(class_="post-content").text
    image = image.find()

print(title,body, image)
sheet.append([title, body, image])

excel.save('news1.xlsx')
As for scraping the URLs, i'm not having much luck.

The site is https://portswigger.net/daily-swig/dark-web

the code is this - but probably no point in even adding this and nothing is working
from bs4 import BeautifulSoup
import requests, openpyxl


source = requests.get('https://portswigger.net/daily-swig/dark-web')
source.raise_for_status()

soup = BeautifulSoup(source.text, 'html.parser')
text = soup.find_all('div', class_="tile-container is-absolute dailyswig size0 style1 textstyle7")

print(text)
But even if this was working I still don't know how I can take those links and feed them into the script to scrape the title/content/image and put that into a xlsx file.

Any help would be great.
Thanks.
Reply
#2
(Feb-22-2022, 02:36 PM)never5000 Wrote: but I can't figure out how to get the image URL
Don't mess with save to excel or loops before you find/know how to get the content.
Under post card class there is one image,get scr back that is a relative url so need to join with base url use to get a working link.
Example:
from bs4 import BeautifulSoup
import requests, openpyxl

source = requests.get('https://portswigger.net/daily-swig/couple-charged-with-laundering-proceeds-from-4-5bn-bitfinex-cryptocurrency-hack')
soup = BeautifulSoup(source.content, 'html.parser')
card_class = soup.find_all(class_="post-card")
>>> img = card_class[0].select_one('img')
>>> img
<img alt="Husband and wife charged with laundering proceeds from $4.5bn Bitfinex cryptocurrency hack " src="/cms/images/fa/5e/29b8-article-220209-bitfinex.png" title="Vladimir Kazakov / Shutterstock"/>
>>> img = card_class[0].select_one('img').get('src')
>>> img
'/cms/images/fa/5e/29b8-article-220209-bitfinex.png'

>>> base_url = 'https://portswigger.net'
>>> img_link = f'{base_url}{img}'
>>> img_link
'https://portswigger.net/cms/images/fa/5e/29b8-article-220209-bitfinex.png'
Quote:But even if this was working I still don't know how I can take those links and feed them into the script to scrape the title/content/image and put that into a xlsx file.
As mention you start bye testing small and try to turn off JavaScript in browser for this site https://portswigger.net/daily-swig/dark-web
Error:
This page requires JavaScript for an enhanced user experience.
Then need other tool like Selenium.
A example
Web-scraping part-2.
Or search site there 100's of posts how to use Selenium alone or give content to BS for scraping.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Questions abou Web-scraping part-2 Tutorial ljmetzger 2 2,813 Mar-25-2018, 09:14 PM
Last Post: ljmetzger
  Detect comments part using web scraping seco 7 4,909 Jan-18-2018, 10:06 PM
Last Post: seco

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020