Posts: 3
Threads: 1
Joined: Jul 2020
Jul-12-2020, 07:03 AM
(This post was last modified: Jul-12-2020, 07:03 AM by 12237ee1.)
I have a large text file which has a lot of links and I need python script to extract all the names of the files which end with .pdf format and sorted without repeated result ?
Example:
Quote:http://www.123.com/file.pdf http://www.123.com/pdfhello
http://www.456.com/hello/one.file.pdf http://www.123.com
http://www.456.com/hello/one.file.pdf
I need the final result to look like this:
file.pdf
one.file.pdf
Posts: 1,838
Threads: 2
Joined: Apr 2017
Posts: 2,121
Threads: 10
Joined: May 2017
Jul-12-2020, 10:11 AM
(This post was last modified: Jul-12-2020, 10:11 AM by DeaD_EyE.)
Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition() .
pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set
with open(file) as fd:
for line in file:
url, last = line.strip().rsplit("/", 1)
# https://docs.python.org/3/library/stdtypes.html#str.rsplit
print(last)
# then check if last.endswith(".pdf")
# if true, add it to the set
# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files) PS: If in one line is more than one address, you could split them with the split() method.
Posts: 2,168
Threads: 35
Joined: Sep 2016
Using pathlib to get the suffix and name and a set for no repeated results
import pathlib
paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
'http://www.456.com/hello/one.file.pdf')
result_set = set()
for path in paths:
path = pathlib.Path(path)
if path.suffix == '.pdf':
result_set.add(path.name)
results = sorted(result_set)
print(results) Output: ['file.pdf', 'one.file.pdf']
Posts: 2,121
Threads: 10
Joined: May 2017
Jul-12-2020, 11:41 AM
(This post was last modified: Jul-12-2020, 11:41 AM by DeaD_EyE.)
Pathlib does not handle urls correct. You could deconstruct an url with urllib.parse.urlparse and construct an url with urllib.parse.urlunparse . Pathlib could handle the ParseResult.path from urlparse() .
Helper functions without pathlib and urllib:
def get_file(url):
return url.rpartition("/")[2]
def is_pdf(file):
return file.rpartition(".")[2] == "pdf"
paths = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
'http://www.456.com/hello/one.file.pdf')
for url in paths:
file = get_file(url)
if is_pdf(file):
print(file) Doing this with preserving the absolute path with urllib and pathlib:
from pathlib import Path
from urllib.parse import urlparse
def converto_to_paths(urls):
for url in urls:
path = Path(urlparse(url).path)
if path.suffix == ".pdf":
yield path
urls = ('http://www.123.com/file.pdf', 'http://www.123.com/pdfhello',
'http://www.456.com/hello/one.file.pdf', 'http://www.123.com',
'http://www.456.com/hello/one.file.pdf')
paths = list(converto_to_paths(urls))
paths_filenames = [file.name for file in paths]
paths_filenames_stem = [file.stem for file in paths]
print(paths, paths_filenames, paths_filenames_stem, sep="\n\n") Output: [PosixPath('/file.pdf'), PosixPath('/hello/one.file.pdf'), PosixPath('/hello/one.file.pdf')]
['file.pdf', 'one.file.pdf', 'one.file.pdf']
['file', 'one.file', 'one.file']
Depending on what you want later to do with your data, you can decide to use or not to use Path and urlparse.
The function urlparse() could also handle relative urls.
Posts: 3
Threads: 1
Joined: Jul 2020
Jul-13-2020, 02:58 PM
(This post was last modified: Jul-13-2020, 02:59 PM by 12237ee1.)
(Jul-12-2020, 10:11 AM)DeaD_EyE Wrote: Iterate over the open file, then you get line by line.
Instead of regex you could use str.rsplit() or str.rpartition() .
pdf_files = set()
# https://docs.python.org/3/library/stdtypes.html#set
with open(file) as fd:
for line in file:
url, last = line.strip().rsplit("/", 1)
# https://docs.python.org/3/library/stdtypes.html#str.rsplit
print(last)
# then check if last.endswith(".pdf")
# if true, add it to the set
# Now sort pdf_files with sorted
# https://docs.python.org/3/library/functions.html#sorted
sorted_pdfs = sorted(pdf_files) PS: If in one line is more than one address, you could split them with the split() method.
I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you
Posts: 2,121
Threads: 10
Joined: May 2017
(Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you
Usually you save your data in your homedirectory.
If your username is 12237ee1 , then path to your home directory is: /home/12237ee1/ . Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.
Applying the set on a regular str does not what you want:
In [1]: set("/home/file.txt")
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'} You need the lines in a sequence and then you can consume them with a set.
with open("/home/12237ee1/file.txt") as fd:
unique_lines = set(fd) fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.
But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.
Posts: 3
Threads: 1
Joined: Jul 2020
(Jul-13-2020, 04:10 PM)DeaD_EyE Wrote: (Jul-13-2020, 02:58 PM)12237ee1 Wrote: I am using linux so where should I put the path of the file in the script
I tried it like this
a= "/home/file.txt"
pdf_files = set(a)
but it didn't work .. also I need the result without brackets only the name of one file in each line
thank you
Usually you save your data in your homedirectory.
If your username is 12237ee1 , then path to your home directory is: /home/12237ee1/ . Put your file into your home directory. If you open your terminal window, you're already loged in with your user. Use pwd to show the Path where you are.
Applying the set on a regular str does not what you want:
In [1]: set("/home/file.txt")
Out[1]: {'.', '/', 'e', 'f', 'h', 'i', 'l', 'm', 'o', 't', 'x'} You need the lines in a sequence and then you can consume them with a set.
with open("/home/12237ee1/file.txt") as fd:
unique_lines = set(fd) fd is an iterator, which yields lines (with line ending).
The set() takes those elements. Identical lines are removed and the order is not preserved.
But these are the basics. You need to know what a file object is, what iterables are and what the different data types does with it.
Otherwise, you'll not understand how Python works. I call it brute-force programming, what you do.
how to do it with split() beacuse yes these's more than one link in one line ? in another word how to only print the name between "/" and ".pdf" while including the ".pdf"
if you can put all the script in one box so I can understand better .. thank you again
|