Get the filename from a path

12237ee1 · (This post was last modified: Jul-12-2020, 07:03 AM by 12237ee1.)

I have a large text file which has a lot of links and I need python script to extract all the names of the files which end with .pdf format and sorted without repeated result ?

Example:

Quote:http://www.123.com/file.pdf http://www.123.com/pdfhello
http://www.456.com/hello/one.file.pdf http://www.123.com
http://www.456.com/hello/one.file.pdf

I need the final result to look like this:

file.pdf
one.file.pdf

ndc85430 · Jul-12-2020, 07:08 AM

What have you tried?

DeaD_EyE · (This post was last modified: Jul-12-2020, 10:11 AM by DeaD_EyE.)

Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition().

pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set

with open(file) as fd:
    for line in file:
        url, last = line.strip().rsplit("/", 1)
        # https://docs.python.org/3/library/stdtypes.html#str.rsplit
        print(last)
        # then check if last.endswith(".pdf")
        # if true, add it to the set


# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files)

PS: If in one line is more than one address, you could split them with the split() method.

**Yoriz** · Jul-12-2020, 10:33 AM

Using pathlib to get the suffix and name and a set for no repeated results

import pathlib

paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

result_set = set()
for path in paths:
    path = pathlib.Path(path)
    if path.suffix == '.pdf':
        result_set.add(path.name)

results = sorted(result_set)
print(results)

Output:
['file.pdf', 'one.file.pdf']

DeaD_EyE · (This post was last modified: Jul-12-2020, 11:41 AM by DeaD_EyE.)

Pathlib does not handle urls correct. You could deconstruct an url with urllib.parse.urlparse and construct an url with urllib.parse.urlunparse. Pathlib could handle the ParseResult.path from urlparse().

Helper functions without pathlib and urllib:

def get_file(url):
    return url.rpartition("/")[2]

def is_pdf(file):
    return file.rpartition(".")[2] == "pdf"


paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

for url in paths:
    file = get_file(url)
    if is_pdf(file):
        print(file)

Doing this with preserving the absolute path with urllib and pathlib:

from pathlib import Path
from urllib.parse import urlparse


def converto_to_paths(urls):
    for url in urls:
        path = Path(urlparse(url).path)
        if path.suffix == ".pdf":
            yield path

urls = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

paths = list(converto_to_paths(urls))
paths_filenames = [file.name for file in paths]
paths_filenames_stem = [file.stem for file in paths]

print(paths, paths_filenames, paths_filenames_stem, sep="\n\n")

Output:[PosixPath('/file.pdf'), PosixPath('/hello/one.file.pdf'), PosixPath('/hello/one.file.pdf')]

['file.pdf', 'one.file.pdf', 'one.file.pdf']

['file', 'one.file', 'one.file']

Depending on what you want later to do with your data, you can decide to use or not to use Path and urlparse.
The function urlparse() could also handle relative urls.

12237ee1 · (This post was last modified: Jul-13-2020, 02:59 PM by 12237ee1.)

(Jul-12-2020, 10:11 AM)DeaD_EyE Wrote: Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition().
pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set

with open(file) as fd:
    for line in file:
        url, last = line.strip().rsplit("/", 1)
        # https://docs.python.org/3/library/stdtypes.html#str.rsplit
        print(last)
        # then check if last.endswith(".pdf")
        # if true, add it to the set


# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files)
PS: If in one line is more than one address, you could split them with the split() method.

I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you

DeaD_EyE · Jul-13-2020, 04:10 PM

(Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you

Usually you save your data in your homedirectory.

If your username is 12237ee1, then path to your home directory is: /home/12237ee1/. Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.

Applying the set on a regular str does not what you want:

In [1]: set("/home/file.txt")                                                                                                
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'}

You need the lines in a sequence and then you can consume them with a set.

with open("/home/12237ee1/file.txt") as fd:
    unique_lines = set(fd)

fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.

But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.

12237ee1 · Jul-13-2020, 06:01 PM

(Jul-13-2020, 04:10 PM)DeaD_EyE Wrote:
(Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you

Usually you save your data in your homedirectory.

If your username is 12237ee1, then path to your home directory is: /home/12237ee1/. Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.

Applying the set on a regular str does not what you want:
In [1]: set("/home/file.txt")                                                                                                
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'}
You need the lines in a sequence and then you can consume them with a set.
with open("/home/12237ee1/file.txt") as fd:
    unique_lines = set(fd)
fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.

But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.

how to do it with split() beacuse yes these's more than one link in one line ? in another word how to only print the name between "/" and ".pdf" while including the ".pdf"
if you can put all the script in one box so I can understand better .. thank you again

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	WebDriverException: Message: 'PATH TO CHROME DRIVER' executable needs to be in PATH	Led_Zeppelin	1	3,199	Sep-09-2021, 01:25 PM Last Post: Yoriz
	.pth file does not show up in sys.path when configuring path.	arjunsingh2908	2	7,601	Jul-03-2018, 11:16 AM Last Post: arjunsingh2908
	scandir() recursively and return path + filename	malonn	6	21,953	May-09-2018, 03:45 PM Last Post: wavic

Get the filename from a path

User Panel Messages

Announcements