Python Forum
Downloading Page Source From URL List
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Downloading Page Source From URL List
#9
Some advice should use Requests and not urllib.
Also save all html source for a site is usually not so useful,can look this recent thread where also use BeautifulSoup.

Something like this can do it,also in code pathlib stem to avoid stuff like this myfold = myurl[9:11].
>>> import pathlib
>>> 
>>> url = 'https://abc.com'
>>> folder_name = pathlib.Path(url).stem
>>> folder_name
'abc'
>>> url = 'https://python-forum.io/'
>>> folder_name = pathlib.Path(url).stem
>>> folder_name
'python-forum'
import requests
from bs4 import BeautifulSoup
from pathlib import Path

class URLSiteSaver:
    def __init__(self, file_path, save_path):
        self.file_path = Path(file_path)
        self.save_path = Path(save_path)
        # Ensure the base save directory exists
        self.save_path.mkdir(parents=True, exist_ok=True)

    def fetch_save(self, url):
        try:
            response = requests.get(url)
            # Create directory directories from site names
            folder_name = Path(url).stem
            site_folder = self.save_path / folder_name
            site_folder.mkdir(parents=True, exist_ok=True)
            html_file_path = site_folder / 'index.html'
            # Save the HTML content to a file in the site's directory
            with html_file_path.open('wb') as html_file:
                html_file.write(response.content)
            print(f"Saved content to {site_folder}")
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")

    def process_urls(self):
        with self.file_path.open() as fp:
            for url in fp:
                url = url.strip()
                print(f"Processing URL: {url}")
                self.fetch_save(url)

if __name__ == '__main__':
    file_path = r'G:\div_code\myurls.txt'
    save_path = r'G:\div_code\urlsites'
    url_saver = URLSiteSaver(file_path, save_path)
    url_saver.process_urls()
Output:
Processing URL: https://python-forum.io/ Saved content to G:\div_code\urlsites\python-forum Processing URL: https://books.toscrape.com/ Saved content to G:\div_code\urlsites\books.toscrape
Reply


Messages In This Thread
RE: Downloading Page Source From URL List - by snippsat - Jun-06-2024, 07:24 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with to check an Input list data with a data read from an external source sacharyya 3 675 Mar-09-2024, 12:33 PM
Last Post: Pedroski55
  Downloading images from webpages H84Gabor 2 2,053 Sep-29-2021, 05:39 PM
Last Post: snippsat
  Downloading a module Xlsxwriter dan789 6 11,655 Jan-26-2019, 02:13 PM
Last Post: dan789
  "if statement" and downloading a dataset Alberto 1 2,608 Jan-25-2018, 01:44 PM
Last Post: ka06059
  Downloading and using pyperclip PMPythonlearner 2 5,218 Dec-31-2017, 04:37 PM
Last Post: PMPythonlearner
  Problem downloading 2.7.8 Mac OSX Benjipincus 2 3,186 Dec-18-2017, 01:33 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020