Python Forum
Get the filename from a path
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Get the filename from a path
#1
I have a large text file which has a lot of links and I need python script to extract all the names of the files which end with .pdf format and sorted without repeated result ?

Example:

Quote:http://www.123.com/file.pdf http://www.123.com/pdfhello
http://www.456.com/hello/one.file.pdf http://www.123.com
http://www.456.com/hello/one.file.pdf

I need the final result to look like this:

file.pdf
one.file.pdf
Reply
#2
What have you tried?
Reply
#3
Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition().

pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set

with open(file) as fd:
    for line in file:
        url, last = line.strip().rsplit("/", 1)
        # https://docs.python.org/3/library/stdtypes.html#str.rsplit
        print(last)
        # then check if last.endswith(".pdf")
        # if true, add it to the set


# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files)
PS: If in one line is more than one address, you could split them with the split() method.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
Using pathlib to get the suffix and name and a set for no repeated results
import pathlib

paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

result_set = set()
for path in paths:
    path = pathlib.Path(path)
    if path.suffix == '.pdf':
        result_set.add(path.name)

results = sorted(result_set)
print(results)
Output:
['file.pdf', 'one.file.pdf']
Reply
#5
Pathlib does not handle urls correct. You could deconstruct an url with urllib.parse.urlparse and construct an url with urllib.parse.urlunparse. Pathlib could handle the ParseResult.path from urlparse().

Helper functions without pathlib and urllib:
def get_file(url):
    return url.rpartition("/")[2]

def is_pdf(file):
    return file.rpartition(".")[2] == "pdf"


paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

for url in paths:
    file = get_file(url)
    if is_pdf(file):
        print(file)
Doing this with preserving the absolute path with urllib and pathlib:
from pathlib import Path
from urllib.parse import urlparse


def converto_to_paths(urls):
    for url in urls:
        path = Path(urlparse(url).path)
        if path.suffix == ".pdf":
            yield path

urls = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
         'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
         'http://www.456.com/hello/one.file.pdf')

paths = list(converto_to_paths(urls))
paths_filenames = [file.name for file in paths]
paths_filenames_stem = [file.stem for file in paths]

print(paths, paths_filenames, paths_filenames_stem, sep="\n\n")
Output:
[PosixPath('/file.pdf'), PosixPath('/hello/one.file.pdf'), PosixPath('/hello/one.file.pdf')] ['file.pdf', 'one.file.pdf', 'one.file.pdf'] ['file', 'one.file', 'one.file']
Depending on what you want later to do with your data, you can decide to use or not to use Path and urlparse.
The function urlparse() could also handle relative urls.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
(Jul-12-2020, 10:11 AM)DeaD_EyE Wrote: Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition().

pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set

with open(file) as fd:
    for line in file:
        url, last = line.strip().rsplit("/", 1)
        # https://docs.python.org/3/library/stdtypes.html#str.rsplit
        print(last)
        # then check if last.endswith(".pdf")
        # if true, add it to the set


# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files)
PS: If in one line is more than one address, you could split them with the split() method.

I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you
Reply
#7
(Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you

Usually you save your data in your homedirectory.

If your username is 12237ee1, then path to your home directory is: /home/12237ee1/. Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.

Applying the set on a regular str does not what you want:
In [1]: set("/home/file.txt")                                                                                                
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'}
You need the lines in a sequence and then you can consume them with a set.

with open("/home/12237ee1/file.txt") as fd:
    unique_lines = set(fd)
fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.

But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#8
(Jul-13-2020, 04:10 PM)DeaD_EyE Wrote:
(Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you

Usually you save your data in your homedirectory.

If your username is 12237ee1, then path to your home directory is: /home/12237ee1/. Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.

Applying the set on a regular str does not what you want:
In [1]: set("/home/file.txt")                                                                                                
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'}
You need the lines in a sequence and then you can consume them with a set.

with open("/home/12237ee1/file.txt") as fd:
    unique_lines = set(fd)
fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.

But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.

how to do it with split() beacuse yes these's more than one link in one line ? in another word how to only print the name between "/" and ".pdf" while including the ".pdf"
if you can put all the script in one box so I can understand better .. thank you again
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  WebDriverException: Message: 'PATH TO CHROME DRIVER' executable needs to be in PATH Led_Zeppelin 1 2,150 Sep-09-2021, 01:25 PM
Last Post: Yoriz
  .pth file does not show up in sys.path when configuring path. arjunsingh2908 2 5,671 Jul-03-2018, 11:16 AM
Last Post: arjunsingh2908
  scandir() recursively and return path + filename malonn 6 17,021 May-09-2018, 03:45 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020