A more intelligent .sort()?

Pedroski55 · Jul-07-2020, 09:38 AM

I made a little program to merge pdfs, works ok.

Apparently, to merge them, you must first split them into individual, 1-page pdfs. Also not a problem.

I split the first pdf into 48 pages. The names all look like this:

pdf_1_page_1.pdf, pdf_1_page_2.pdf, pdf_1_page_3.pdf and so on.

Then I read them in:

splitPDFs = os.listdir(pathToSplitPDF)

I don't know by which criterion/criteria os.listdir() reads files in, but pdf_1_page_1.pdf is not splitPDFs[0]

So I do splitPDFs.sort(), then I have pdf_1_page_1.pdf as splitPDFs[0].

However splitPDFs[1] is not pdf_1_page_2.pdf but pdf_1_page_10.pdf, then comes pdf_1_page_11.pdf, pdf_1_page_12.pdf and so on.

I made a workaround, so that I can merge the files in the correct numerical order.

I'm just wondering if there is a way to make sort() sort them in the order pdf_1_page_1.pdf, pdf_1_page_2.pdf, pdf_1_page_3.pdf?

DeaD_EyE · Jul-07-2020, 11:11 AM

I think the order of os.scandir, os.walk and os.listdir comes from inode-number of a file. But you're not the first one with this problem.

Python Code Glitch May Have Caused Errors In Over 100 Published Studies

The built-in function sorted and the method list.sort takes an argument for key.
The items are sorted by this key. If you just sort strings, then the lexicographical order is applied.
The numbers must be converted into integer.
The key is a function, which takes one element and return something (often an int).

Sorting just the strings:

['pdf_10_page1',
 'pdf_10_page2',
 'pdf_10_page3',
 'pdf_10_page4',
 'pdf_1_page1',
 'pdf_1_page2',
 'pdf_1_page3',
 'pdf_1_page4',
 'pdf_2_page1',
 'pdf_2_page2',
 'pdf_2_page3',
 'pdf_2_page4',
 'pdf_3_page1',
 'pdf_3_page2',
 'pdf_3_page3',
 'pdf_3_page4',
 'pdf_4_page1',
 'pdf_4_page2',
 'pdf_4_page3',
 'pdf_4_page4',
 'pdf_5_page1',
 'pdf_5_page2',
 'pdf_5_page3',
 'pdf_5_page4',
 'pdf_6_page1',
 'pdf_6_page2',
 'pdf_6_page3',
 'pdf_6_page4',
 'pdf_7_page1',
 'pdf_7_page2',
 'pdf_7_page3',
 'pdf_7_page4',
 'pdf_8_page1',
 'pdf_8_page2',
 'pdf_8_page3',
 'pdf_8_page4',
 'pdf_9_page1',
 'pdf_9_page2',
 'pdf_9_page3',
 'pdf_9_page4']

First you need to know the pattern of your files. Then you can apply regex, to get the numbers out of the string.
Example:

import re


def sort_pdfs(pdf):
    match = re.search(r"pdf_(\d+)_page(\d+)", pdf)
    if match:
        return tuple(map(int, match.groups()))
    else:
        return (0, 0)
        # if the pattern does not match




pdfs = ['pdf_10_page1',
 'pdf_10_page2',
 'pdf_10_page3',
 'pdf_10_page4',
 'pdf_1_page1',
 'pdf_1_page2',
 'pdf_1_page3',
 'pdf_1_page4',
 'pdf_2_page1',
 'pdf_2_page2',
 'pdf_2_page3',
 'pdf_2_page4',
 'pdf_3_page1',
 'pdf_3_page2',
 'pdf_3_page3',
 'pdf_3_page4',
 'pdf_4_page1',
 'pdf_4_page2',
 'pdf_4_page3',
 'pdf_4_page4',
 'pdf_5_page1',
 'pdf_5_page2',
 'pdf_5_page3',
 'pdf_5_page4',
 'pdf_6_page1',
 'pdf_6_page2',
 'pdf_6_page3',
 'pdf_6_page4',
 'pdf_7_page1',
 'pdf_7_page2',
 'pdf_7_page3',
 'pdf_7_page4',
 'pdf_8_page1',
 'pdf_8_page2',
 'pdf_8_page3',
 'pdf_8_page4',
 'pdf_9_page1',
 'pdf_9_page2',
 'pdf_9_page3',
 'pdf_9_page4'
]

pdfs.sort(key=sort_pdfs)

Output:['pdf_1_page1',
 'pdf_1_page2',
 'pdf_1_page3',
 'pdf_1_page4',
 'pdf_2_page1',
 'pdf_2_page2',
 'pdf_2_page3',
 'pdf_2_page4',
 'pdf_3_page1',
 'pdf_3_page2',
 'pdf_3_page3',
 'pdf_3_page4',
 'pdf_4_page1',
 'pdf_4_page2',
 'pdf_4_page3',
 'pdf_4_page4',
 'pdf_5_page1',
 'pdf_5_page2',
 'pdf_5_page3',
 'pdf_5_page4',
 'pdf_6_page1',
 'pdf_6_page2',
 'pdf_6_page3',
 'pdf_6_page4',
 'pdf_7_page1',
 'pdf_7_page2',
 'pdf_7_page3',
 'pdf_7_page4',
 'pdf_8_page1',
 'pdf_8_page2',
 'pdf_8_page3',
 'pdf_8_page4',
 'pdf_9_page1',
 'pdf_9_page2',
 'pdf_9_page3',
 'pdf_9_page4',
 'pdf_10_page1',
 'pdf_10_page2',
 'pdf_10_page3',
 'pdf_10_page4']

***snippsat*** · (This post was last modified: Jul-07-2020, 05:48 PM by snippsat.)

There is names for these problems human or natural sort.
@DeaD_EyE solution look fine.
There also libraries for this natsort.
Testing natsort with code from DeaD_EyE.

# pip install natsort
>>> from natsort import natsorted
>>> 
>>> natsorted(lst)
['pdf_1_page1',
 'pdf_1_page2',
 'pdf_1_page3',
 'pdf_1_page4',
 'pdf_2_page1',
 'pdf_2_page2',
 'pdf_2_page3',
 'pdf_2_page4',
 'pdf_3_page1',
 'pdf_3_page2',
 'pdf_3_page3',
 'pdf_3_page4',
 'pdf_4_page1',
 'pdf_4_page2',
 'pdf_4_page3',
 'pdf_4_page4',
 'pdf_5_page1',
 'pdf_5_page2',
 'pdf_5_page3',
 'pdf_5_page4',
 'pdf_6_page1',
 'pdf_6_page2',
 'pdf_6_page3',
 'pdf_6_page4',
 'pdf_7_page1',
 'pdf_7_page2',
 'pdf_7_page3',
 'pdf_7_page4',
 'pdf_8_page1',
 'pdf_8_page2',
 'pdf_8_page3',
 'pdf_8_page4',
 'pdf_9_page1',
 'pdf_9_page2',
 'pdf_9_page3',
 'pdf_9_page4',
 'pdf_10_page1',
 'pdf_10_page2',
 'pdf_10_page3',
 'pdf_10_page4']

Pedroski55 · Jul-07-2020, 10:49 PM

@ DeaD_EyE
Wow, I can't pretend to understand it, but it works great! Thanks very much! I had a feeling there was a better way to do this!

I've tried to understand re, but it is very cryptic. I do use re to take words out of texts and leave gaps, but that's all!

My (unsorted) file names are like this: 'CE3_1_page_29.pdf', 'CE3_1_page_41.pdf', 'CE3_1_page_28.pdf', 'CE3_1_page_14.pdf'

I didn't know if I needed the whole name or not, so I tried:

match = re.search(r"CE3_(\d+)_page_(\d+)", pdf) and match = re.search(r"CE3_(\d+)_page_(\d+).pdf", pdf)

Both worked!

Now all I have to do is try and understand it!

@snippsat Thanks, I'll fetch natsort and try to use it! Thanks!

DeaD_EyE · Jul-08-2020, 07:52 AM

I haven't looked into the code of natsort. They've could implement a key function like this:

import re

# use instead this package
# https://pypi.org/project/natsort/

def natkey(text):
    result = []
    for element in re.split(r"(\d+)", text):
        if element.isdecimal():
            result.append(int(element))
        else:
            result.extend(map(ord, element))
    return tuple(result)

The re.split splits all numbers from the rest.
The parenthesis around \d+ is to capture this. Otherwise, you'll get None if there was a decimal.
Each character has a code point, which you get with ord().

A = ord("A")
a = ord("a")
print(A, hex(A), sep=", ")
print(a, hex(a), sep=", ")

Just sorting a text, is done by lexicographical order. Usually, a string consists more than one element.
Comparing mixed types in a tuple is not possible. The resulting tuple must have only int as elements (or another data type, which is comparable). To convert a string into a tuple of code points:

greeting = "Greetings and salvation."
result = tuple(map(ord, greeting))
print(result)

Then you get a tuple with numbers back:

Output:
(71, 114, 101, 101, 116, 105, 110, 103, 115, 32, 97, 110, 100, 32, 115, 97, 108, 118, 97, 116, 105, 111, 110, 46)

Now the problem is, that you want to convert numbers in the str to the datatype int.
A string has many methods to do checks. For example, you can check if the str only consists of numbers: str.isdecimal. There are much more methods.

In the example function natkey I convert the str to an int with the built-in function int(). This is appended to the list.

If the str is non-decimal the else-block is executed. The method list.extend() takes an iterable and extends the list with the elements form iterable.

Applying the function to your example:

In [37]: for t in ('CE3_1_page_29.pdf', 'CE3_1_page_28.pdf'):
    ...:     print(natkey(t))
    ...:
(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102)
(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102)

In the first result is the number 29 and in the second the number 28.
The rest is identical. Sorting this tuples now:

In [38]: tuple1 = natkey('CE3_1_page_29.pdf')
    ...: tuple2 = natkey('CE3_1_page_28.pdf')
    ...: sorted([tuple1, tuple2])
Out[38]:
[(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102),
 (67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102)]

The 28 is smaller and comes first. You can reverse the order.

In [39]: tuple1 = natkey('CE3_1_page_29.pdf')
    ...: tuple2 = natkey('CE3_1_page_28.pdf')
    ...: sorted([tuple1, tuple2], reverse=True)
Out[39]:
[(67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 29, 46, 112, 100, 102),
 (67, 69, 3, 95, 1, 95, 112, 97, 103, 101, 95, 28, 46, 112, 100, 102)]

Finally you apply the function to your filenames:

result = sorted(['CE3_1_page_29.pdf', 'CE3_1_page_41.pdf', 'CE3_1_page_28.pdf', 'CE3_11_page_14.pdf'], key=natkey)
print(result)

So instead let doing sorted the work to create the key for comparison, you use your own key-function, which retuns this tuples (they can also be lists).

Output:
['CE3_1_page_28.pdf', 'CE3_1_page_29.pdf', 'CE3_1_page_41.pdf', 'CE3_11_page_14.pdf']

Later in your code you do something like this:

import os


for root, dirs, files in os.walk("."):
    for file in sorted(files, key=natkey):
        # files are sorted in memory
        if file.endswith(".pdf"):
            print(os.path.join(root, file))

My output:

Output:.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.5792.43\help\ReferenceCard.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.5792.43\help\ReferenceCardForMac.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.6109.24\help\ReferenceCard.pdf
.\AppData\Local\JetBrains\Toolbox\apps\PyCharm-P\ch-0\202.6109.24\help\ReferenceCardForMac.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\back.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\filesave.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\forward.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\hand.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\help.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\home.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\matplotlib.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\move.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\qt4_editor_options.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\subplots.pdf
.\AppData\Local\Programs\Python\Python38\Lib\site-packages\matplotlib\mpl-data\images\zoom_to_rect.pdf

The . comes from os.walk. It's a relative path. You can also use absolute paths.
The files in a directory are sorted by natkey.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	a.sort() == b.sort() all the time	3lnyn0	1	1,971	Apr-19-2022, 06:50 PM Last Post: Gribouillis
	some ideas for intelligent list splitting?	wardancer84	4	4,220	Nov-20-2018, 02:47 PM Last Post: DeaD_EyE

A more intelligent .sort()?

User Panel Messages

Announcements