A more intelligent .sort()?

DeaD_EyE · Jul-08-2020, 07:52 AM

I haven't looked into the code of natsort. They've could implement a key function like this:

import re

# use instead this package
# https://pypi.org/project/natsort/

def natkey(text):
    result = []
    for element in re.split(r"(\d+)", text):
        if element.isdecimal():
            result.append(int(element))
        else:
            result.extend(map(ord, element))
    return tuple(result)

The re.split splits all numbers from the rest.
The parenthesis around \d+ is to capture this. Otherwise, you'll get None if there was a decimal.
Each character has a code point, which you get with ord().

A = ord("A")
a = ord("a")
print(A, hex(A), sep=", ")
print(a, hex(a), sep=", ")

Just sorting a text, is done by lexicographical order. Usually, a string consists more than one element.
Comparing mixed types in a tuple is not possible. The resulting tuple must have only int as elements (or another data type, which is comparable). To convert a string into a tuple of code points:

greeting = "Greetings and salvation."
result = tuple(map(ord, greeting))
print(result)

Then you get a tuple with numbers back:

Output:
(71, 114, 101, 101, 116, 105, 110, 103, 115, 32, 97, 110, 100, 32, 115, 97, 108, 118, 97, 116, 105, 111, 110, 46)

Now the problem is, that you want to convert numbers in the str to the datatype int.
A string has many methods to do checks. For example, you can check if the str only consists of numbers: str.isdecimal. There are much more methods.

In the example function natkey I convert the str to an int with the built-in function int(). This is appended to the list.

If the str is non-decimal the else-block is executed. The method list.extend() takes an iterable and extends the list with the elements form iterable.

Applying the function to your example:

In [37]: for t in ('CE3_1_page_29.pdf', 'CE3_1_page_28.pdf'):
    ...:     print(natkey(t))
    ...:
(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102)
(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102)

In the first result is the number 29 and in the second the number 28.
The rest is identical. Sorting this tuples now:

In [38]: tuple1 = natkey('CE3_1_page_29.pdf')
    ...: tuple2 = natkey('CE3_1_page_28.pdf')
    ...: sorted([tuple1, tuple2])
Out[38]:
[(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102),
 (67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102)]

The 28 is smaller and comes first. You can reverse the order.

In [39]: tuple1 = natkey('CE3_1_page_29.pdf')
    ...: tuple2 = natkey('CE3_1_page_28.pdf')
    ...: sorted([tuple1, tuple2], reverse=True)
Out[39]:
[(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102),
 (67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102)]

Finally you apply the function to your filenames:

result = sorted(['CE3_1_page_29.pdf', 'CE3_1_page_41.pdf', 'CE3_1_page_28.pdf', 'CE3_11_page_14.pdf'], key=natkey)
print(result)

So instead let doing sorted the work to create the key for comparison, you use your own key-function, which retuns this tuples (they can also be lists).

Output:
['CE3_1_page_28.pdf', 'CE3_1_page_29.pdf', 'CE3_1_page_41.pdf', 'CE3_11_page_14.pdf']

Later in your code you do something like this:

import os


for root, dirs, files in os.walk("."):
    for file in sorted(files, key=natkey):
        # files are sorted in memory
        if file.endswith(".pdf"):
            print(os.path.join(root, file))

My output:

Output:.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.5792.43\help\ReferenceCard.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.5792.43\help\ReferenceCardForMac.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.6109.24\help\ReferenceCard.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.6109.24\help\ReferenceCardForMac.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\back.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\filesave.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\forward.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\hand.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\help.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\home.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\matplotlib.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\move.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\qt4_editor_options.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\subplots.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\zoom_to_rect.pdf

The . comes from os.walk. It's a relative path. You can also use absolute paths.
The files in a directory are sorted by natkey.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	a.sort() == b.sort() all the time	3lnyn0	1	1,345	Apr-19-2022, 06:50 PM Last Post: Gribouillis
	some ideas for intelligent list splitting?	wardancer84	4	3,235	Nov-20-2018, 02:47 PM Last Post: DeaD_EyE

A more intelligent .sort()?

User Panel Messages

Announcements