Pathlib does not handle urls correct. You could deconstruct an url with
Helper functions without pathlib and urllib:
The function
urllib.parse.urlparse
and construct an url with urllib.parse.urlunparse
. Pathlib could handle the ParseResult.path
from urlparse()
.Helper functions without pathlib and urllib:
def get_file(url): return url.rpartition("/")[2] def is_pdf(file): return file.rpartition(".")[2] == "pdf" paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello', 'http://www.456.com/hello/one.file.pdf', 'http://www.123.com', 'http://www.456.com/hello/one.file.pdf') for url in paths: file = get_file(url) if is_pdf(file): print(file)Doing this with preserving the absolute path with urllib and pathlib:
from pathlib import Path from urllib.parse import urlparse def converto_to_paths(urls): for url in urls: path = Path(urlparse(url).path) if path.suffix == ".pdf": yield path urls = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello', 'http://www.456.com/hello/one.file.pdf', 'http://www.123.com', 'http://www.456.com/hello/one.file.pdf') paths = list(converto_to_paths(urls)) paths_filenames = [file.name for file in paths] paths_filenames_stem = [file.stem for file in paths] print(paths, paths_filenames, paths_filenames_stem, sep="\n\n")
Output:[PosixPath('/file.pdf'), PosixPath('/hello/one.file.pdf'), PosixPath('/hello/one.file.pdf')]
['file.pdf', 'one.file.pdf', 'one.file.pdf']
['file', 'one.file', 'one.file']
Depending on what you want later to do with your data, you can decide to use or not to use Path and urlparse.The function
urlparse()
could also handle relative urls.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
All humans together. We don't need politicians!