Sorting and Merging text-files [SOLVED] - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Sorting and Merging text-files [SOLVED] (/thread-34672.html) Pages:
1
2
|
Sorting and Merging text-files [SOLVED] - AlphaInc - Aug-19-2021 Hello everybody, With a batch-script I export mutliple text-files to a specific folder where I want to merge them into one. Therefore I used the type-command in windows cmd which worked fine. But after comparing the files I noticed that the order is wrong. I found out the CMD sorts them differently (file_1,file_10,file_11,file_2,file_3,file_33) instead of how I'm doing it (file_1,file_2,file_3,file_10,file_11,file_33). Now I would like to use a Python Script do merge them together in the way I would sort them (file_1,file_2,file_3,file_10,file_11,file_33). This is what I have so far but I don't know how to go on: #!/usr/bin/env python3 import os import re folder_path = "../Outputs/" for root, dirs, files in os.walk(folder_path, topdown = False): for name in files: if name.endswith(".txt"): file_name = os.path.join(root, name)Edit: All my files start with string "file_" and an ongoing number. It differs how many files there will be after I exported them so I can't set it up manually. RE: Sorting and Merging text-files - Gribouillis - Aug-19-2021 Why are you calling os.walk() if you want to merge the files only in one folder?
RE: Sorting and Merging text-files - DeaD_EyE - Aug-19-2021 #!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation:
Using the program (I named it randomly searchp.py): I created the Outputs directory in my home directory and put 3 files in, where each file had only one line with a newline character at the end.
RE: Sorting and Merging text-files - deanhystad - Aug-19-2021 If you have control of the file names the easiest solution is to generate file names that can be sorted properly: file_000, file_001, file_010. file_100 RE: Sorting and Merging text-files - AlphaInc - Aug-19-2021 (Aug-19-2021, 02:08 PM)DeaD_EyE Wrote:#!/usr/bin/env python3 from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object outputs = Path.home() / "Outputs" # print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line, end="") # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc...Relevant documentation: Thanks for your reply. So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ? RE: Sorting and Merging text-files - snippsat - Aug-19-2021 (Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?On Windows the home path will be C:\Users\<username>\OutputsYou can give path to where you have the .txt files,if not want to make this Outputs folder. Eg. outputs = Path(r'G:\div_code\answer')Test and i write code to save lines,as redirect as show bye DeaD_EyE may not work for you as eg gzip,zcat is not a part of Windows. from pathlib import Path def sort_by_int(path): # Path has the stem attribute, which is # the filename without the last extension # to sort the paths by integer, you # need to get the integer part of the str # and convert it to an integer # the _ is the character where you can split # maxsplit=1 does only split once, # so you get two elements back # if the _ is missing, split will raise an Exception return int(path.stem.split("_", maxsplit=1)[1]) # Use the high level Path object #outputs = Path.home() / "Outputs" outputs = Path(r'G:\div_code\answer') #print(outputs) # Path: /home/username/Outputs # use glob for easier search # rglob is to search recursive # glob and rglob replicates the shell-syntax # the wildcard is one * and a ? stands for one character search = "file_*.txt" # sorted takes a key argument, which is used to define how it's sorted # sort_by_int just returns an int and the sorted function # is using this number to sort sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) #print(sorted_outputs) # the result is a list # sorted consumes the iterable object and returns a list # with the sorted elements # now using this sorted list with Path objects lines = [] for path in sorted_outputs: # glob does not differe between files, directories or other # elements # so you need to check, if path is a file if path.is_file(): print(path) # if it's a file, then open it # the Path onject do have the method open # it supports like the open function a context manager with path.open() as fd: # iterating this file line by line # where the line end is not stripped away for line in fd: # print the line, but tell print not to add an additional # line end, because the line has already a line end print(line) lines.append(line.strip()) #zf.writestr(str(path), f) # you can use the stdout to redirect the output # in your shell to a file for example or netcat # or gzip etc... with open('lines.txt', 'w') as f: f.write('\n'.join(lines)) lines.txt:
RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021 (Aug-19-2021, 10:12 PM)snippsat Wrote:(Aug-19-2021, 05:02 PM)AlphaInc Wrote: So when I start the script on my windows machine there is no output. Do I need to do some tweaking of your script ?On Windows the home path will beC:\Users\<username>\OutputsYou can give path to where you have the .txt files,if not want to make this Outputs folder. Thanks also for your help. I can see that the Script process all my files but it does not create an output-file Edit: My Bad, just looked after the wrong name. I found it. Thank you. RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021 (Aug-20-2021, 05:48 AM)AlphaInc Wrote:(Aug-19-2021, 10:12 PM)snippsat Wrote: On Windows the home path will beC:\Users\<username>\OutputsYou can give path to where you have the .txt files,if not want to make this Outputs folder. Sorry once again but I get an error: Traceback (most recent call last): File "FileProcessing_11.py", line 30, in <module> sorted_outputs = sorted(outputs.glob(search), key=sort_by_int) File "FileProcessing_11.py", line 13, in sort_by_int return int(path.stem.split("_", maxsplit=1)[1]) IndexError: list index out of rangeCould it be because the files got spaces in it? RE: Sorting and Merging text-files - AlphaInc - Aug-20-2021 (Aug-20-2021, 08:12 AM)AlphaInc Wrote:(Aug-20-2021, 05:48 AM)AlphaInc Wrote: Thanks also for your help. I can see that the Script process all my files but it does not create an output-file RE: Sorting and Merging text-files - snippsat - Aug-20-2021 (Aug-20-2021, 08:12 AM)AlphaInc Wrote: Could it be because the files got spaces in it?I couple of tips how you troubleshoot this. def sort_by_int(path): print(path) print(path.stem) # Path has the stem attribute, which isBye adding this you see what happen before error. Test. >>> f = Path(r'G:\div_code\answer\file_33.txt') >>> f.stem 'file_33' >>> f.stem.split('_', maxsplit=1) ['file', '33'] >>> f.stem.split('_', maxsplit=1)[1] '33'Make your error. >>> f = Path(r'G:\div_code\answer\file33.txt') >>> f.stem 'file33' >>> f.stem.split('_', maxsplit=1) ['file33'] >>> f.stem.split('_', maxsplit=1)[1] Traceback (most recent call last): File "<interactive input>", line 1, in <module> IndexError: list index out of rangeIn your first post all files you show file_1,file_2,file_3,file_10...ect all had a _ ,then it should work.With those print() or add repr() (see all like eg space) you will se all files input before the error. print(repr(path)) print(repr(path.stem)) |