Python Forum
Sorting and Merging text-files [SOLVED]
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Sorting and Merging text-files [SOLVED]
#1
Hello everybody,

With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one.
Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33).

Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33).

This is what I have so far but I don't know how to go on:

#!/usr/bin/env python3

import os
import re

folder_path = "../Outputs/"

for root, dirs, files in os.walk(folder_path, topdown = False):
    for name in files:
        if name.endswith(".txt"):
            file_name = os.path.join(root, name)
Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually.
Reply
#2
Why are you calling os.walk() if you want to merge the files only in one folder?
Reply
#3
#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100
Reply
#5
(Aug-19-2021, 02:08 PM)DeaD_EyE Wrote:
#!/usr/bin/env python3

from pathlib import Path


def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])


# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line, end="")
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...
Relevant documentation:
Using the program (I named it randomly searchp.py):


Output:
[andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz [andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz 666 123 000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.

Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
Reply
#6
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4
Reply
#7
(Aug-19-2021, 10:12 PM)snippsat Wrote:
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Reply
#8
(Aug-20-2021, 05:48 AM)AlphaInc Wrote:
(Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will be
C:\Users\<username>\Outputs
You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer')
Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path

def sort_by_int(path):
    # Path has the stem attribute, which is
    # the filename without the last extension
    # to sort the paths by integer, you
    # need to get the integer part of the str
    # and convert it to an integer
    # the _ is the character where you can split
    # maxsplit=1 does only split once,
    # so you get two elements back
    # if the _ is missing, split will raise an Exception
    return int(path.stem.split("_", maxsplit=1)[1])

# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs

# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character

search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements

# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
    # glob does not differe between files, directories or other
    # elements
    # so you need to check, if path is a file
    if path.is_file():
        print(path)
        # if it's a file, then open it
        # the Path onject do have the method open
        # it supports like the open function a context manager
        with path.open() as fd:
            # iterating this file line by line
            # where the line end is not stripped away
            for line in fd:
                # print the line, but tell print not to add an additional
                # line end, because the line has already a line end
                print(line)
                lines.append(line.strip())
                #zf.writestr(str(path), f)
                # you can use the stdout to redirect the output
                # in your shell to a file for example or netcat
                # or gzip etc...

with open('lines.txt', 'w') as f:
    f.write('\n'.join(lines))
Output:
G:\div_code\answer\file_1.txt line1 G:\div_code\answer\file_3.txt line2 G:\div_code\answer\file_10.txt line3 G:\div_code\answer\file_33.txt line4
lines.txt:
Output:
line1 line2 line3 line4

Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?
Reply
#9
(Aug-20-2021, 08:12 AM)AlphaInc Wrote:
(Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file

Edit: My Bad, just looked after the wrong name. I found it. Thank you.

Sorry once again but I get an error:

Traceback (most recent call last):
  File "FileProcessing_11.py", line 30, in <module>
    sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
  File "FileProcessing_11.py", line 13, in sort_by_int
    return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range
Could it be because the files got spaces in it?
Reply
#10
(Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it?
I couple of tips how you troubleshoot this.
def sort_by_int(path):
    print(path)
    print(path.stem)
    # Path has the stem attribute, which is
Bye adding this you see what happen before error.

Test.
>>> f = Path(r'G:\div_code\answer\file_33.txt')
>>> f.stem
'file_33'
>>> f.stem.split('_', maxsplit=1)
['file', '33']
>>> f.stem.split('_', maxsplit=1)[1]
'33'
Make your error.
>>> f = Path(r'G:\div_code\answer\file33.txt')
>>> f.stem
'file33'
>>> f.stem.split('_', maxsplit=1)
['file33']
>>> f.stem.split('_', maxsplit=1)[1]
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
IndexError: list index out of range
In your first post all files you show file_1,file_2,file_3,file_10...ect all had a _,then it should work.
With those print() or add repr()(see all like eg space) you will se all files input before the error.
print(repr(path))
print(repr(path.stem))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question [solved] compressing files with python. SpongeB0B 1 609 May-26-2023, 03:33 PM
Last Post: SpongeB0B
  Help replacing word in Mutiple files. (SOLVED) mm309d 0 796 Mar-21-2023, 03:43 AM
Last Post: mm309d
  Merging multiple csv files with same X,Y,Z in each Auz_Pete 3 1,085 Feb-21-2023, 04:21 AM
Last Post: Auz_Pete
  azure TTS from text files to mp3s mutantGOD 2 1,638 Jan-17-2023, 03:20 AM
Last Post: mutantGOD
  [SOLVED] [BeautifulSoup] How to get this text? Winfried 6 1,927 Aug-17-2022, 03:58 PM
Last Post: Winfried
  Writing into 2 text files from the same function paul18fr 4 1,629 Jul-28-2022, 04:34 AM
Last Post: ndc85430
  Delete empty text files [SOLVED] AlphaInc 5 1,511 Jul-09-2022, 02:15 PM
Last Post: DeaD_EyE
  Human Sorting (natsort) does not work [SOLVED] AlphaInc 2 1,095 Jul-04-2022, 10:21 AM
Last Post: AlphaInc
  select files such as text file RolanRoll 2 1,129 Jun-25-2022, 08:07 PM
Last Post: RolanRoll
  Two text files, want to add a column value zxcv101 8 1,841 Jun-20-2022, 03:06 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020