Python Forum
Sorting and Merging text-files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Sorting and Merging text-files
#1
Hello everybody,

With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one.
Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33).

Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33).

This is what I have so far but I don't know how to go on:

#!/usr/bin/env python3

import os
import re

folder_path = "../Outputs/"

for root, dirs, files in os.walk(folder_path, topdown = False):
    for name in files:
        if name.endswith(".txt"):
            file_name = os.path.join(root, name)
Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually.
Reply
#2
Why are you calling os.walk() if you want to merge the files only in one folder?
Reply
#3
#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[[email protected] ~]$ python3 searchp.py | gzip > output.txt.gz [[email protected] ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100
Reply
#5
(Aug-19-2021, 02:08 PM)DeaD_EyE Wrote:
#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[[email protected] ~]$ python3 searchp.py | gzip > output.txt.gz [[email protected] ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.

Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
Reply
#6
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4
Reply
#7
(Aug-19-2021, 10:12 PM)snippsat Wrote:
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Reply
#8
(Aug-20-2021, 05:48 AM)AlphaInc Wrote:
(Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?
Reply
#9
(Aug-20-2021, 08:12 AM)AlphaInc Wrote:
(Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?
Reply
#10
(Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it?
I couple of tips how you troubleshoot this.
def sort_by_int(path):
    print(path)
    print(path.stem)
    # Path has the stem attribute, which is
Bye adding this you see what happen before error.

Test.
>>> f = Path(r'G:\div_code\answer\file_33.txt')
>>> f.stem
'file_33'
>>> f.stem.split('_', maxsplit=1)
['file', '33']
>>> f.stem.split('_', maxsplit=1)[1]
'33'
Make your error.
>>> f = Path(r'G:\div_code\answer\file33.txt')
>>> f.stem
'file33'
>>> f.stem.split('_', maxsplit=1)
['file33']
>>> f.stem.split('_', maxsplit=1)[1]
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
IndexError: list index out of range
In your first post all files you show file_1,file_2,file_3,file_10...ect all had a _,then it should work.
With those print() or add repr()(see all like eg space) you will se all files input before the error.
print(repr(path))
print(repr(path.stem))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Replace String in multiple text-files [SOLVED] AlphaInc 5 668 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Several pdf files to text mfernandes 10 1,080 Jul-07-2021, 11:39 PM
Last Post: Pedroski55
  Open and read multiple text files and match words kozaizsvemira 3 4,248 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  Reading Multiple text Files in pyhton Fatim 1 433 Jun-25-2021, 01:37 PM
Last Post: deanhystad
  Increment text files output and limit contains Kaminsky 1 854 Jan-30-2021, 06:58 PM
Last Post: bowlofred
  Merging all file_name.log's files from directory to one and search “PerformanceINFO" sutra 0 505 Dec-09-2020, 05:14 PM
Last Post: sutra
  Split gps files based on time (text splitting) dervast 0 523 Nov-09-2020, 09:19 AM
Last Post: dervast
  Searching for specific word in text files. JellyCreeper6 1 607 Nov-03-2020, 01:52 PM
Last Post: DeaD_EyE
  Merging Excel Files JezMim 1 795 Sep-06-2020, 08:56 PM
Last Post: bowlofred
  Outputting Sorted Text files Help charlieroberrts 1 652 Jul-05-2020, 08:37 PM
Last Post: menator01

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020