Python Forum
Sorting and Merging text-files [SOLVED] - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Sorting and Merging text-files [SOLVED] (/thread-34672.html)

Pages: 1 2


Sorting and Merging text-files [SOLVED] - AlphaInc - Aug-19-2021

Hello everybody,

With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one.
Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33).

Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33).

This is what I have so far but I don't know how to go on:

#!/usr/bin/env python3

import os
import re

folder_path = "../Outputs/"

for root, dirs, files in os.walk(folder_path, topdown = False):
    for name in files:
        if name.endswith(".txt"):
            file_name = os.path.join(root, name)
Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually.


RE: Sorting and Merging text-files - Gribouillis - Aug-19-2021

Why are you calling os.walk() if you want to merge the files only in one folder?


RE: Sorting and Merging text-files - DeaD_EyE - Aug-19-2021

#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.


RE: Sorting and Merging text-files - deanhystad - Aug-19-2021

If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100


RE: Sorting and Merging text-files - AlphaInc - Aug-19-2021

(Aug-19-2021, 02:08 PM)DeaD_EyE Wrote:
#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.

Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?


RE: Sorting and Merging text-files - snippsat - Aug-19-2021

(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4



RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021

(Aug-19-2021, 10:12 PM)snippsat Wrote:
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.


RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021

(Aug-20-2021, 05:48 AM)AlphaInc Wrote:
(Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?


RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021

(Aug-20-2021, 08:12 AM)AlphaInc Wrote:
(Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?



RE: Sorting and Merging text-files - snippsat - Aug-20-2021

(Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it?
I couple of tips how you troubleshoot this.
def sort_by_int(path):
    print(path)
    print(path.stem)
    # Path has the stem attribute, which is
Bye adding this you see what happen before error.

Test.
>>> f = Path(r'G:\div_code\answer\file_33.txt')
>>> f.stem
'file_33'
>>> f.stem.split('_', maxsplit=1)
['file', '33']
>>> f.stem.split('_', maxsplit=1)[1]
'33'
Make your error.
>>> f = Path(r'G:\div_code\answer\file33.txt')
>>> f.stem
'file33'
>>> f.stem.split('_', maxsplit=1)
['file33']
>>> f.stem.split('_', maxsplit=1)[1]
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
IndexError: list index out of range
In your first post all files you show file_1,file_2,file_3,file_10...ect all had a _,then it should work.
With those print() or add repr()(see all like eg space) you will se all files input before the error.
print(repr(path))
print(repr(path.stem))