Python Forum
List of pathlib.Paths Not Ordered As Same List of Same String Filenames
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
List of pathlib.Paths Not Ordered As Same List of Same String Filenames
#1
While testing a module, I have found a weird behaviour of pathlib package. I have a list of pathlib.Paths and I sorted() it. I supposed that the order retrieved by sorted() a list of Paths would be the same as the order retrieved by sorted() a list of their (string) filenames. But it is not the case.

Let me explain.

I have a list of filenames such as :

filenames_for_testing = (
    '/spam/spams.txt',
    '/spam/spam.txt',
    '/spam/another.txt',
    '/spam/binary.bin',
    '/spam/spams/spam.ttt',
    '/spam/spams/spam01.txt',
    '/spam/spams/spam02.txt',
    '/spam/spams/spam03.ppp',
    '/spam/spams/spam04.doc',
)
If I run the following:

sorted_filenames = sorted(filenames_for_testing)
print()
[print(element) for element in sorted_filenames]
print()
the alphabetical (string) order of this list will be:
  • /spam/another.txt
  • /spam/binary.bin
  • /spam/spam.txt
  • /spam/spams.txt
  • /spam/spams/spam.ttt
  • /spam/spams/spam01.txt
  • /spam/spams/spam02.txt
  • /spam/spams/spam03.ppp
  • /spam/spams/spam04.doc

But when I try to order the same list as pathlib.Paths using:

from pathlib import Path

paths_for_testing = [
    Path(filename)
    for filename in filenames_for_testing
]
sorted_paths = sorted(paths_for_testing)
The list returned is (just showing filenames of the pathlib.Paths):
  • /spam/another.txt
  • /spam/binary.bin
  • /spam/spam.txt
  • /spam/spams/spam.ttt
  • /spam/spams/spam01.txt
  • /spam/spams/spam02.txt
  • /spam/spams/spam03.ppp
  • /spam/spams/spam04.doc
  • /spam/spams.txt

which is different from previous list because 'spam/spams.txt' does not go after '/spam/spam.txt' and before all '/spam/spams/*' files (instead, it goes at the end of the list).

You can check it using:

sorted_filenames == [str(path) for path in sorted_paths]
which returns False.

I am not sure this would be a bug. Maybe it is the intended purpose. However, I think that it is a weird behaviour. Unless I am missing something, I can hardly understand why a list of pathlib.Paths and a list with the same string filenames can be ordered in the same fashion.

A crafted script to test this:
from pathlib import Path

# order string filenames

filenames_for_testing = (
    '/spam/spams.txt',
    '/spam/spam.txt',
    '/spam/another.txt',
    '/spam/binary.bin',
    '/spam/spams/spam.ttt',
    '/spam/spams/spam01.txt',
    '/spam/spams/spam02.txt',
    '/spam/spams/spam03.ppp',
    '/spam/spams/spam04.doc',
)

sorted_filenames = sorted(filenames_for_testing)

# output ordered list of string filenames

print()
print("Ordered list of string filenames:")
print()
[print(f'\t{element}') for element in sorted_filenames]
print()

# order paths (build from same string filenames)

paths_for_testing = [
    Path(filename)
    for filename in filenames_for_testing
]
sorted_paths = sorted(paths_for_testing)

# output ordered list of pathlib.Paths

print()
print("Ordered list of pathlib.Paths:")
print()
[print(f'\t{element}'
       ) for element in sorted_paths]
print()

# compare

print()

if sorted_filenames == [str(path) for path in sorted_paths]:
    print('Ordered lists of string filenames and pathlib.Paths are EQUAL.')
    
else:
    print('Ordered lists of string filenames and pathlib.Paths are DIFFERENT.')

    for element in range(0, len(sorted_filenames)):
        
        if sorted_filenames[element] != str(sorted_paths[element]):
            
            print()
            print('First different element:')
            print(f'\tElement #{element}')
            print(f'\t{sorted_filenames[element]} != {sorted_paths[element]}')
            break

print()
I am running Python 3.6.3 on MacOs 10.12.6

Thanks.
Reply
#2
Path(filename) is an object and not a string.
If you take the string instead, the sort is identical.

paths_for_testing = [
    str(Path(filename))      # <== take str(object)
    for filename in filenames_for_testing
]
Windows 7 with Python 3.6.2:

Output:
Ordered list of string filenames:     /spam/another.txt     /spam/binary.bin     /spam/spam.txt     /spam/spams.txt     /spam/spams/spam.ttt     /spam/spams/spam01.txt     /spam/spams/spam02.txt     /spam/spams/spam03.ppp     /spam/spams/spam04.doc Ordered list of pathlib.Paths:     \spam\another.txt     \spam\binary.bin     \spam\spam.txt     \spam\spams.txt     \spam\spams\spam.ttt     \spam\spams\spam01.txt     \spam\spams\spam02.txt     \spam\spams\spam03.ppp     \spam\spams\spam04.doc
Reply
#3
Of course, pathlib.Path is an object, not a string.

But pathlib.Paths are (or should be) compare using their filenames (which are strings). (Roughly speaking: in reality, pathlib.Paths are compared in other way that I really do not understand watching Python's source code).

So I think ordering (or comparing) pathlib.Paths and ordering (or comparing) their string filenames should render the same result, not a different one.
Reply
#4
you should look at: https://pymotw.com/3/pathlib/
a pathlib path is much easier to construct if the path nodes are contained
within a list:
from pathlib import Path

mylocation  = ['..', 'data', 'fipsCodes', 'GNIScodesForNamedPopulatedPlaces-etc', 'CountryNames', 'geonames_20171023', 'Countries.txt']

home = Path('.')
print('\n-- home --')
print(f'{home}')

print(f'{home.name}')
print(f'{home.resolve()}')

print('\n-- mydatapath --')
mydatapath = home.joinpath(*mylocation)
print(f'(\n{mydatapath}')
print(f'({mydatapath.name}')
print(f'{mydatapath.resolve()}')

print('\n-- newdatapath --')
# you can also create a path like
newdatapath = home / 'data'
print(f'\n{newdatapath}')
print(f'{newdatapath.name}')
print(f'{newdatapath.resolve()}')

print('\n-- filelist --')
filelist = [x.name for x in newdatapath.iterdir() if x.is_file()]
print(f'\n{filelist}')

print('\n-- opening files --')
fips_text_file = newdatapath / 'fips.txt'

with fips_text_file.open() as f:
    count = 0
    for line in f:
        line = line.strip()
        count += 1
        print(line)
        if count > 10:
            break
results (part of resolved path removed for security, replaced with ...):
Output:
-- home -- . ... \Tiger\src -- mydatapath -- ( ..\data\fipsCodes\GNIScodesForNamedPopulatedPlaces-etc\CountryNames\geonames_20171023\Countries.txt (Countries.txt ... \Tiger\data\fipsCodes\GNIScodesForNamedPopulatedPlaces-etc\CountryNames\geonames_20171023\Countries.txt -- newdatapath -- data data ... \Tiger\src\data -- filelist -- ['fips.json', 'fips.txt', 'fipsdata.db', 'fipsdataBackup.db', 'FIPSFormat.json', 'FIPSFormat.txt', 'GNIS_CountryFormat.json', 'GNIS_CountryFormat.txt', 'GNIS_DomesticFormat.json', 'GNIS_DomesticFormat.txt'] -- opening files -- { "AmericanIndianAreas": { "data": { "0010": [ "0010", "Acoma Pueblo and Off-Reservation Trust Land" ], "0020": [ "0020", "Agua Caliente Indian Reservation and Off-Reservation Trust Land" ],
Reply
#5
Sorry, Larz60+, but I can see which is the relationship between your answer and my question. Am I missing something? (I am asking this with all respect, of course. Just trying to learn).
Reply
#6
My theory is:
Internally the path is not saved as a string but as a list of path components like Larz60 mentioned.
So when you sort Path objects, internally the lists are compared and not the strings.
This is faster than converting each time the list to a string.
Reply
#7
if the items in the tuple are truly pathlib objects,  they must be resolved before sorting,
otherwise you are sorting the object addresses
Reply
#8
I see. I suspected something like that looking at source.

However, this behaviour can create some incongruences (tough, I conceal, in some few cases). It could be faster, but... is it really worth it?

If you need to convert "manually" Paths to strings to get a "proper" ordered list of pathlib.Paths (instead of making "automatically" on the package), I do not see real gain. I just see a point of incongruence, weird behaviour, and potential programmer's flaws.

Because I cannot fully understand the algorithm behind pathlib comparations, I cannot say —as you and Larz60+ pointed out— if it is really necessary to convert to a string for comparing internally two Paths. Is not it possible to get an alphabetical order (such as on strings) using internal lists on pathlib implementation itself?

Of course, this is a minor annoyance. I have been working with pathlib since its inception and it is the first time I have encountered this.

Thanks.
Reply
#9
All of the support that comes with pathlib is very well worth it.
I have been using it in applications and it saves a great deal of time.

Consider the following snippet of code:
 
            for key, entry in ffmt.items():
                filelist =
                filepath = self.fips.homepath.joinpath(*entry['location'])
                print(f'\n{filepath.resolve()}')
                if entry['filename'] == '..multi..':
                    filelist = [x for x in filepath.iterdir() if x.is_file()]
                else:
                    filelist.append(filepath)
                for file in filelist:
                    with file.open(encoding=encode) as f:
                        for rec in f:
                            fields = self.prepare_rec(rec.strip(), entry, gethead)
ffmt is a dictionary containing information on about twenty files. all located in separate
directories. Each of this dictionary's items nested dictionaries containing information pertaining
to each file. The sub dictionary, contains entries for file location, delimiter, and field information,
part of which is shown here:

                    'location': ['..', 'data', 'fipsCodes', 'GNIScodesForNamedPopulatedPlaces-etc',
                                 'CountryNames', 'geonames_20171023', 'Countries.txt'],
                    'delim': '    ',
from which filepath can be constructed.
This entire structure allows for a very simple interface that is easy to understand, and does oh so much!
A descriptor dictionary such as this can be easily stored in a json file, for use by all programs in the
application.

In my book, pathlib is well worth the effort required to be comfortable with. Take a look at the docs,
here
Reply
#10
@Larz60+
Which is the name of your  book and where can i find it ?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Next/Prev file without loading all filenames WilliamKappler 9 600 Apr-12-2024, 05:13 AM
Last Post: Pedroski55
  How to parse and group hierarchical list items from an unindented string in Python? ann23fr 0 208 Mar-27-2024, 01:16 PM
Last Post: ann23fr
  Sample random, unique string pairs from a list without repetitions walterwhite 1 465 Nov-19-2023, 10:07 PM
Last Post: deanhystad
  trouble reading string/module from excel as a list popular_dog 0 434 Oct-04-2023, 01:07 PM
Last Post: popular_dog
  No matter what I do I get back "List indices must be integers or slices, not list" Radical 4 1,189 Sep-24-2023, 05:03 AM
Last Post: deanhystad
  String to List question help James_Thomas 6 994 Sep-06-2023, 02:32 PM
Last Post: deanhystad
  Delete strings from a list to create a new only number list Dvdscot 8 1,559 May-01-2023, 09:06 PM
Last Post: deanhystad
  List all possibilities of a nested-list by flattened lists sparkt 1 930 Feb-23-2023, 02:21 PM
Last Post: sparkt
  convert string to float in list jacklee26 6 1,940 Feb-13-2023, 01:14 AM
Last Post: jacklee26
  Checking if a string contains all or any elements of a list k1llcod3 1 1,118 Jan-29-2023, 04:34 AM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020