Posts: 56
Threads: 23
Joined: Jul 2021
Aug-19-2021, 01:03 PM
(This post was last modified: Jun-24-2022, 10:15 AM by AlphaInc.)
Hello everybody,
With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one.
Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33).
Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33).
This is what I have so far but I don't know how to go on:
#!/usr/bin/env python3
import os
import re
folder_path = "../Outputs/"
for root, dirs, files in os.walk(folder_path, topdown = False):
for name in files:
if name.endswith(".txt"):
file_name = os.path.join(root, name) Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually.
Posts: 4,802
Threads: 77
Joined: Jan 2018
Why are you calling os.walk() if you want to merge the files only in one folder?
Posts: 2,128
Threads: 11
Joined: May 2017
#!/usr/bin/env python3
from pathlib import Path
def sort_by_int(path):
# Path has the stem attribute, which is
# the filename without the last extension
# to sort the paths by integer, you
# need to get the integer part of the str
# and convert it to an integer
# the _ is the character where you can split
# maxsplit=1 does only split once,
# so you get two elements back
# if the _ is missing, split will raise an Exception
return int(path.stem.split("_", maxsplit=1)[1])
# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs
# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character
search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements
# now using this sorted list with Path objects
for path in sorted_outputs:
# glob does not differe between files, directories or other
# elements
# so you need to check, if path is a file
if path.is_file():
# if it's a file, then open it
# the Path onject do have the method open
# it supports like the open function a context manager
with path.open() as fd:
# iterating this file line by line
# where the line end is not stripped away
for line in fd:
# print the line, but tell print not to add an additional
# line end, because the line has already a line end
print(line, end="")
# you can use the stdout to redirect the output
# in your shell to a file for example or netcat
# or gzip etc... Relevant documentation:
Using the program (I named it randomly searchp.py):
Output: [andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz
[andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz
666
123
000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
Posts: 6,817
Threads: 20
Joined: Feb 2020
If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100
Posts: 56
Threads: 23
Joined: Jul 2021
(Aug-19-2021, 02:08 PM)DeaD_EyE Wrote: #!/usr/bin/env python3
from pathlib import Path
def sort_by_int(path):
# Path has the stem attribute, which is
# the filename without the last extension
# to sort the paths by integer, you
# need to get the integer part of the str
# and convert it to an integer
# the _ is the character where you can split
# maxsplit=1 does only split once,
# so you get two elements back
# if the _ is missing, split will raise an Exception
return int(path.stem.split("_", maxsplit=1)[1])
# Use the high level Path object
outputs = Path.home() / "Outputs"
# print(outputs)
# Path: /home/username/Outputs
# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character
search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements
# now using this sorted list with Path objects
for path in sorted_outputs:
# glob does not differe between files, directories or other
# elements
# so you need to check, if path is a file
if path.is_file():
# if it's a file, then open it
# the Path onject do have the method open
# it supports like the open function a context manager
with path.open() as fd:
# iterating this file line by line
# where the line end is not stripped away
for line in fd:
# print the line, but tell print not to add an additional
# line end, because the line has already a line end
print(line, end="")
# you can use the stdout to redirect the output
# in your shell to a file for example or netcat
# or gzip etc... Relevant documentation:
Using the program (I named it randomly searchp.py):
Output: [andre@andre-Fujitsu-i5 ~]$ python3 searchp.py | gzip > output.txt.gz
[andre@andre-Fujitsu-i5 ~]$ zcat output.txt.gz
666
123
000
I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?
Posts: 7,324
Threads: 123
Joined: Sep 2016
Aug-19-2021, 10:12 PM
(This post was last modified: Aug-19-2021, 10:12 PM by snippsat.)
(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path
def sort_by_int(path):
# Path has the stem attribute, which is
# the filename without the last extension
# to sort the paths by integer, you
# need to get the integer part of the str
# and convert it to an integer
# the _ is the character where you can split
# maxsplit=1 does only split once,
# so you get two elements back
# if the _ is missing, split will raise an Exception
return int(path.stem.split("_", maxsplit=1)[1])
# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs
# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character
search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements
# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
# glob does not differe between files, directories or other
# elements
# so you need to check, if path is a file
if path.is_file():
print(path)
# if it's a file, then open it
# the Path onject do have the method open
# it supports like the open function a context manager
with path.open() as fd:
# iterating this file line by line
# where the line end is not stripped away
for line in fd:
# print the line, but tell print not to add an additional
# line end, because the line has already a line end
print(line)
lines.append(line.strip())
#zf.writestr(str(path), f)
# you can use the stdout to redirect the output
# in your shell to a file for example or netcat
# or gzip etc...
with open('lines.txt', 'w') as f:
f.write('\n'.join(lines)) Output: G:\div_code\answer\file_1.txt
line1
G:\div_code\answer\file_3.txt
line2
G:\div_code\answer\file_10.txt
line3
G:\div_code\answer\file_33.txt
line4
lines.txt:
Output: line1
line2
line3
line4
Posts: 56
Threads: 23
Joined: Jul 2021
Aug-20-2021, 05:48 AM
(This post was last modified: Aug-20-2021, 05:48 AM by AlphaInc.)
(Aug-19-2021, 10:12 PM)snippsat Wrote: (Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path
def sort_by_int(path):
# Path has the stem attribute, which is
# the filename without the last extension
# to sort the paths by integer, you
# need to get the integer part of the str
# and convert it to an integer
# the _ is the character where you can split
# maxsplit=1 does only split once,
# so you get two elements back
# if the _ is missing, split will raise an Exception
return int(path.stem.split("_", maxsplit=1)[1])
# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs
# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character
search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements
# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
# glob does not differe between files, directories or other
# elements
# so you need to check, if path is a file
if path.is_file():
print(path)
# if it's a file, then open it
# the Path onject do have the method open
# it supports like the open function a context manager
with path.open() as fd:
# iterating this file line by line
# where the line end is not stripped away
for line in fd:
# print the line, but tell print not to add an additional
# line end, because the line has already a line end
print(line)
lines.append(line.strip())
#zf.writestr(str(path), f)
# you can use the stdout to redirect the output
# in your shell to a file for example or netcat
# or gzip etc...
with open('lines.txt', 'w') as f:
f.write('\n'.join(lines)) Output: G:\div_code\answer\file_1.txt
line1
G:\div_code\answer\file_3.txt
line2
G:\div_code\answer\file_10.txt
line3
G:\div_code\answer\file_33.txt
line4
lines.txt:
Output: line1
line2
line3
line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file
Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Posts: 56
Threads: 23
Joined: Jul 2021
(Aug-20-2021, 05:48 AM)AlphaInc Wrote: (Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will be C:\Users\<username>\Outputs You can give path to where you have the .txt files,if not want to make this Outputs folder.
Eg.
outputs = Path(r'G:\div_code\answer') Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows.
from pathlib import Path
def sort_by_int(path):
# Path has the stem attribute, which is
# the filename without the last extension
# to sort the paths by integer, you
# need to get the integer part of the str
# and convert it to an integer
# the _ is the character where you can split
# maxsplit=1 does only split once,
# so you get two elements back
# if the _ is missing, split will raise an Exception
return int(path.stem.split("_", maxsplit=1)[1])
# Use the high level Path object
#outputs = Path.home() / "Outputs"
outputs = Path(r'G:\div_code\answer')
#print(outputs)
# Path: /home/username/Outputs
# use glob for easier search
# rglob is to search recursive
# glob and rglob replicates the shell-syntax
# the wildcard is one * and a ? stands for one character
search = "file_*.txt"
# sorted takes a key argument, which is used to define how it's sorted
# sort_by_int just returns an int and the sorted function
# is using this number to sort
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
#print(sorted_outputs)
# the result is a list
# sorted consumes the iterable object and returns a list
# with the sorted elements
# now using this sorted list with Path objects
lines = []
for path in sorted_outputs:
# glob does not differe between files, directories or other
# elements
# so you need to check, if path is a file
if path.is_file():
print(path)
# if it's a file, then open it
# the Path onject do have the method open
# it supports like the open function a context manager
with path.open() as fd:
# iterating this file line by line
# where the line end is not stripped away
for line in fd:
# print the line, but tell print not to add an additional
# line end, because the line has already a line end
print(line)
lines.append(line.strip())
#zf.writestr(str(path), f)
# you can use the stdout to redirect the output
# in your shell to a file for example or netcat
# or gzip etc...
with open('lines.txt', 'w') as f:
f.write('\n'.join(lines)) Output: G:\div_code\answer\file_1.txt
line1
G:\div_code\answer\file_3.txt
line2
G:\div_code\answer\file_10.txt
line3
G:\div_code\answer\file_33.txt
line4
lines.txt:
Output: line1
line2
line3
line4
Thanks also for your help. I can see that the Script process all my files but it does not create an output-file
Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Sorry once again but I get an error:
Traceback (most recent call last):
File "FileProcessing_11.py", line 30, in <module>
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
File "FileProcessing_11.py", line 13, in sort_by_int
return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range Could it be because the files got spaces in it?
Posts: 56
Threads: 23
Joined: Jul 2021
(Aug-20-2021, 08:12 AM)AlphaInc Wrote: (Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file
Edit: My Bad, just looked after the wrong name. I found it. Thank you.
Sorry once again but I get an error:
Traceback (most recent call last):
File "FileProcessing_11.py", line 30, in <module>
sorted_outputs = sorted(outputs.glob(search), key=sort_by_int)
File "FileProcessing_11.py", line 13, in sort_by_int
return int(path.stem.split("_", maxsplit=1)[1])
IndexError: list index out of range Could it be because the files got spaces in it?
Posts: 7,324
Threads: 123
Joined: Sep 2016
Aug-20-2021, 10:14 AM
(This post was last modified: Aug-20-2021, 10:14 AM by snippsat.)
(Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it? I couple of tips how you troubleshoot this.
def sort_by_int(path):
print(path)
print(path.stem)
# Path has the stem attribute, which is Bye adding this you see what happen before error.
Test.
>>> f = Path(r'G:\div_code\answer\file_33.txt')
>>> f.stem
'file_33'
>>> f.stem.split('_', maxsplit=1)
['file', '33']
>>> f.stem.split('_', maxsplit=1)[1]
'33' Make your error.
>>> f = Path(r'G:\div_code\answer\file33.txt')
>>> f.stem
'file33'
>>> f.stem.split('_', maxsplit=1)
['file33']
>>> f.stem.split('_', maxsplit=1)[1]
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
IndexError: list index out of range In your first post all files you show file_1,file_2,file_3,file_10...ect all had a _ ,then it should work.
With those print() or add repr() (see all like eg space) you will se all files input before the error.
print(repr(path))
print(repr(path.stem))
|