Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Listing files with glob.
#8
Here an example how you can write a program, which starts with 2 lines and then .... ah let's improve it, enhance it, ...

But the easiest approach is: First filter coarsely, then filter finely.

pathlib.Path.glob("*.txt") or glob.glob("*.txt") to filter coarsely.
pathlib.Path.rglob("*.txt") or glob.glob("*.txt", recursive=True) to filter coarsely recursive.

The glob and rglob returns a generator, which yields pathlib.Path objects.
The glob.glob returns a list with str. It's not a generator, just a normal function.

The iterative function in glob is iglob.
I am not sure how far they implemented the fnmatch, but often it's not powerful enough.

Then use a regex to filter each finding finely.
Visit https://regex101.com/ to test your regex against a string or multiline string.
There is also a explanation what the single characters are doing.
For example, \w matches a-zA-Z0-9_ and \w+ matches one or more from this set.
\d matches the digits 0-9 and \d{6} matches exactly 6 digits.
Putting parenthesis around the character groups will group them.

Using the regex (\w+)(\d{6}) will also match: anything99999999123456.txt
The first group will be: anything99999999
The second group will be: 123456

If you want to prevent this, you can use a range of character in square brackets. Also here the + after the bracket means, that this will match one or more. The ^ at the start marks the start of a str. This prevents the regex to shift to right side, until it matches.
The $ marks the end of the str. This will prevent the regex to reach the end of the str after 6 numbers.
The regex: ^([a-zA-Z])+(\d{6})$
The \.txt is not included. Instead, you can use Path.stem, which is only the file name as a str.
If you use glob.glob, then you can't do this.

The classic safest way to split the last suffix from a path with low level api:
import os

path, file = os.path.split("/home/deadeye/file.txt.foo.bar.gz.py")
name, suffix = os.path.splitext(file)
print(path)
print(file)
print(name)
print(suffix)
And with pathlib:
from pathlib import Path


my_path = Path("/home/deadeye/file.txt.foo.bar.gz.py")
print(my_path.name)
print(my_path.suffix)
print(my_path.suffixes)
print(my_path.parent)
Pattern like this:
\w+.txt
matches also some123Word&txt
To prevent this, the . must be escaped with \.

But the simple rule is, use glob to filter by extension, use regex for complex tasks and constructs like [0-9][0-9][0-9][0-9][0-9] won't help you.
The wildcard destroys everything and is the same as the weak regex. Often a bad regex leads into security issues.


Here an example. Hopefully, the regex is now right.
The type annotation stuff is not required to understand.
The type annotations have no influence of runtime, but they can help linters or IDEs to display problems with types.
Python is a very type safe. Knowing the return type and the argument types what a function takes, is the half work.

#!/usr/bin/env python3

"""
This program find files by following pattern:
    > dog000001.txt
    > cat000054.txt
    > lion010101.txt
    > mouse123456.txt
Visit for more info: https://python-forum.io/Thread-Listing-files-with-glob
The files are returned in sorted order by name and then by integer.
By the way, it's totally overengineered and you should not use this.
"""

import re
import sys
import json
from argparse import ArgumentParser, Namespace
from pathlib import Path
from typing import Union, Optional, Generator, Callable, List, Tuple, Any


PATH = Union[str, Path]
PATH_GLOB = Generator[Path, None, None]
FIND_RESULT = Tuple[str, int, Path]
FIND_RESULTS = List[FIND_RESULT]
SORT_FUNC = Optional[Callable[[FIND_RESULT], Tuple[Any]]]

REGEX = re.compile(r"^([a-zA-Z])+(\d{6})$")


def get_txt(root: PATH) -> PATH_GLOB:
    """
    :param root: Path where to find the txt files
    
    :return: A generator which yields Path objects
    """
    yield from Path(root).glob("*.txt")


def find(root: PATH, sort_func: SORT_FUNC = None) -> FIND_RESULTS:
    """
    :param root: Path where to find the txt files
    :param sort_func: Sort Function which takes a FIND_RESULT
    :return: A generator which yields FIND_RESULT
    """
    regex = REGEX
    files: FIND_RESULTS = []
    for file in get_txt(root):
        if match := regex.search(file.stem):
            name = match.group(1)
            number = int(match.group(2))
            files.append((name, number, file))
    files.sort(key=sort_func)
    return files


def by_number(result: FIND_RESULT) -> Tuple[int, str]:
    """
    Key function to sort results by number and then by name
    :param result:
    :return:
    """
    return result[1], result[0]


def print_results(results: FIND_RESULTS, count: bool = False) -> None:
    """
    Print results to stdout
    :param results: The results from find
    :param count: Print the count of matches at the end
    :return:
    """
    for *_, file in results:
        print(file)
    if count:
        print(f"{len(results)} matching files found.")


# def print_json_stream(results: FIND_RESULTS) -> None:
#     """
#     DoNotUseThisCode
#     otherwise implement escaping
#     """
#     count = len(results)
#     header = f'{{"count":{count},"results":['
#     print(header, end="", flush=True)
#     for index, (*_, file) in enumerate(results, 1):
#         time.sleep(2)
#         print(f'"{file}"', flush=True, end="")
#         if index < count:
#             print(",", flush=True, end="")
#     print("]}")


def print_json(results: FIND_RESULTS) -> None:
    """
    Print the results as json to stdout
    :param results:
    :return:
    """
    files = [str(file) for *_, file in results]
    result = {
        "count": len(files),
        "results": files,
    }
    print(json.dumps(result))


def get_args() -> Namespace:
    """
    Get arguments
    :return: parsed arguments
    """
    desc = f"""
        This example program finds txt files
        and uses the internal specified regex to match
        them.
        The following regex is used: {REGEX.pattern}
    """
    parser = ArgumentParser(description=desc)
    # noinspection PyTypeChecker
    parser.add_argument("root", type=Path, help="Root path where to search the files.")
    parser.add_argument(
        "-n",
        action="store_const",
        const=by_number,
        default=None,
        help="Sort first by numeric value and then by name.",
    )
    parser.add_argument(
        "-c", action="store_true", help="Show at the end the file count."
    )
    parser.add_argument("-j", action="store_true", help="Json stream output")
    arguments = parser.parse_args()
    if not arguments.root.exists():
        raise FileNotFoundError("Path does not exist.")
    if not arguments.root.is_dir():
        raise FileNotFoundError("Path is not a directory.")
    return arguments


if __name__ == "__main__":
    try:
        args = get_args()
    except FileNotFoundError as e:
        print(e, file=sys.stderr)
        sys.exit(1)
    if args.j:
        print_json(find(args.root, args.n))
    else:
        print_results(results=find(args.root, args.n), count=args.c)
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Messages In This Thread
Listing files with glob. - by MathCommander - Aug-10-2020, 02:47 PM
RE: Listing files with glob. - by deanhystad - Aug-10-2020, 02:55 PM
RE: Listing files with glob. - by MathCommander - Aug-10-2020, 03:39 PM
RE: Listing files with glob. - by deanhystad - Aug-10-2020, 04:43 PM
RE: Listing files with glob. - by MathCommander - Aug-10-2020, 04:59 PM
RE: Listing files with glob. - by bowlofred - Aug-10-2020, 06:22 PM
RE: Listing files with glob. - by deanhystad - Aug-10-2020, 06:44 PM
RE: Listing files with glob. - by DeaD_EyE - Aug-10-2020, 11:05 PM
RE: Listing files with glob. - by snippsat - Aug-12-2020, 08:20 AM
RE: Listing files with glob. - by MathCommander - Oct-26-2020, 02:04 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Listing directories (as a text file) kiwi99 1 854 Feb-17-2023, 12:58 PM
Last Post: Larz60+
  Read directory listing of files and parse out the highest number? cubangt 5 2,426 Sep-28-2022, 10:15 PM
Last Post: Larz60+
  |SOLVED] Glob JPGs, read EXIF, update file timestamp? Winfried 5 2,525 Oct-21-2021, 03:29 AM
Last Post: buran
  [SOLVED] Input parameter: Single file or glob? Winfried 0 1,602 Sep-10-2021, 11:54 AM
Last Post: Winfried
  q re glob.iglob iterator and close jimr 2 2,265 Aug-23-2021, 10:14 PM
Last Post: perfringo
  Listing All Methods Of Associated With A Class JoeDainton123 3 2,385 May-10-2021, 01:46 AM
Last Post: deanhystad
  Listing data from a list ebolisa 1 1,757 Sep-29-2020, 02:24 PM
Last Post: DeaD_EyE
  Listing Attributes of Objects & Classes JoeDainton123 4 2,376 Aug-28-2020, 05:27 AM
Last Post: ndc85430
  Listing groups tharpa 2 2,595 Nov-26-2019, 07:25 AM
Last Post: DeaD_EyE
  Version of glob for that Supports Windows Wildcards? Reverend_Jim 5 5,694 Jun-18-2019, 06:31 PM
Last Post: Reverend_Jim

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020