Python Forum
Fetching html files from local directories
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fetching html files from local directories
#1
import os
import urllib.request

# the path where the html is located
path = r"C:\Users\The Capricorn\Documents\Html"   


for filename in os.listdir(path):
    # Now we have to find the full path name of the files
    subpath = os.path.join(path,filename)
    if subpath.endswith('.html'):
            print(subpath)
            print('Reading',filename,'....')
            html = open(subpath,'r').read()
            if html:
                print('Successfully fetched Html')
    else:
        for file in os.listdir(subpath):
            # getting the full path of html file
            fullpath = os.path.join(subpath,file)
            if fullpath.endswith('.html'):
                print(fullpath)
                print('Reading',file,'....')
                html = open(fullpath,'r').read()
                if html:
                    print('Successfully fetched Html')
                
This code is to fetch local HTML files in the directory. This works fine when path contains only a single sub-folder inside it or no sub-folders but not when there are folders inside sub-folders as well and gives an error if files with different extension instead of Html are present inside path. What should I do to correct this?
Reply
#2
You use os.walk() which is recursive walking the whole tree.
Example:
import os

for root, dirs, files in os.walk(r'E:\1\web_title'):
    for file in files:
        if file.endswith('.html'):
            print(file)
Output:
Pipfile_1.html Pipfile_2.html Pipfile_3.html
join together then see that these files are in nested sub-folders.
import os

for root, dirs, files in os.walk(r'E:\1\web_title'):
    for file in files:
        if file.endswith('.html'):
            print(os.path.join(root, file))
Output:
E:\1\web_title\Pipfile_1.html E:\1\web_title\New folder\Pipfile_2.html E:\1\web_title\New folder\New folder\New folder\Pipfile_3.html
Reply
#3
The pathlib module exists since 3.4.
It gives you a better abstraction.
Written as a generator:

def find_by_ext(root, suffix):
    for root, dirs, files in os.walk(root):
        for file in files:
            path = pathlib.Path(root, file)
            if path.suffix == suffix:
                yield path
The argument root is the start point.
Suffix should be '.html' in your case.
The generator returns for each iteration a Path object.

To get the same behaviour, you can write a second function, which is
iterating over the generator:

def open_all_html(root):
    for file in find_by_ext(root, '.html'):
        try:
            data = file.open('r', encoding='utf-8', errors='ignore')
        except Exception as error:
            print('Could not open file {}. Error: {}'.format(file, error))
        else:
            print('Successfully opened file {}.'.format(file))
            # normally you do something with the data
            # this can also be put into a extra function
Calling it:

open_all_html('Downloads/')
Output:
Successfully opened file Downloads/asterisk-15.2.2/asterisk-15.2.2-summary.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjmedia/docs/footer.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjmedia/docs/header.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjnath/docs/footer.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjnath/docs/header.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjlib-util/docs/footer.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjlib-util/docs/header.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjsip/docs/footer.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjsip/docs/header.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjlib/docs/footer.html. Successfully opened file Downloads/asterisk-15.2.2/third-party/pjproject/source/pjlib/docs/header.html. Successfully opened file Downloads/asterisk-15.2.2/static-http/mantest.html. Successfully opened file Downloads/asterisk-15.2.2/static-http/ajamdemo.html. Successfully opened file Downloads/skyradar-gui/ui/index.html.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
from glob import glob

htmls = glob('*.htm*', recursive=True)
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Organization of project directories wotoko 3 434 Mar-02-2024, 03:34 PM
Last Post: Larz60+
  uploading files from a ubuntu local directory to Minio storage container dchilambo 0 461 Dec-22-2023, 07:17 AM
Last Post: dchilambo
  Listing directories (as a text file) kiwi99 1 841 Feb-17-2023, 12:58 PM
Last Post: Larz60+
  Find duplicate files in multiple directories Pavel_47 9 3,125 Dec-27-2022, 04:47 PM
Last Post: deanhystad
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 931 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  fetching exit status hangs in paramiko saisankalpj 3 1,179 Dec-04-2022, 12:21 AM
Last Post: nilamo
  rename same file names in different directories elnk 0 715 Nov-04-2022, 05:23 PM
Last Post: elnk
  Fetching the port number using asyncio gary 0 948 Nov-01-2022, 02:53 AM
Last Post: gary
  I need to copy all the directories that do not match the pattern tester_V 7 2,444 Feb-04-2022, 06:26 PM
Last Post: tester_V
  Functions to consider for file renaming and moving around directories cubangt 2 1,760 Jan-07-2022, 02:16 PM
Last Post: cubangt

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020