Python Forum

Full Version: split txt file data on the first column value
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a bunch of text files that look this.
H0002   Version 3                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAA                                                                                                                     
H0999   Tene_no/Combined_rept_no    E79/38975 
H1000	Sae_Id	GAM_E	GAM_N                                                                                                                      
H1001   Tene_holder Magnetic Resources NL  
Want to seperate text data based on the first column value. First column start with H and followed by a number. If a number is less than 1000, I want to save as file1.txt and if a number is greater or equal to 1000 I want to save in a different txt file2.txt.

File1.txt
H0002   Version 3                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAA                                                                                                                     
H0999   Tene_no/Combined_rept_no    E79/38975 
File2.txt
H1000	Sae_Id	GAM_E	GAM_N                                                                                                                      
H1001   Tene_holder Magnetic Resources NL  
My python code:

import warnings
from pathlib import Path
import time
import parser
import argparse
import pandas as pd

pd.set_option('display.max_rows', None)

warnings.filterwarnings('ignore')

parser = argparse.ArgumentParser(description='Process some integers.')

parser.add_argument('-path', help='define the directory to folder/file')
parser.add_argument('-path_save', help='define where to save the file')
parser.add_argument('--verbose', help='display processing information')

start = time.time()


def main(path_txt, path_save, verbose):

    if path_txt.is_file():
        txt_files = [Path(path_txt)]  # For One File
    else:
        txt_files = list(Path(path_txt).glob("*.txt"))

    for fn in txt_files:
       with open(fn) as f:
        text = f.read().strip()
        print(text)
        
if __name__ == '__main__':
    start = time.time()
    args = parser.parse_args()
    path = Path(args.path)
    path_save = Path(args.path_save)
    verbose = args.verbose
    main(path, path_save, verbose)  # Calling Main Function
    print('Processed time:', time.time() - start)  # Total Time  



Any help, How to get this task done?
I think something like this is easier done without pandas:
from pathlib import Path
import os
import time

def split_data():
    # Start in same dir as script
    os.chdir(os.path.abspath(os.path.dirname(__file__)))

    homepath = Path('.')
    infile = homepath / 'File0.txt'
    out1 = homepath / 'File1.txt'
    out2 = homepath / 'File2.txt'

    with infile.open() as fp, out1.open('w') as fout1, out2.open('w') as fout2:
        for line in fp:
            linex = line.strip().split()
            if int(linex[0][1:]) > 999:
                fout1.write(line)
            else:
                fout2.write(line)


if __name__ == '__main__':
    split_data()
Similar example with some comments.
In this code, I don't use pathlib.Path.
Using some new language features. It should run with Python 3.9.



#!/usr/bin/env python3

from argparse import ArgumentParser


def filter_file(input_file, small_file, big_file):
    # open first file as input (read-only text utf8)
    # open the second file as write-text utf8
    # open the thrid file as write-text utf8

    with (
        open(input_file, "rt", encoding="utf8") as fd_read,
        open(small_file, "wt", encoding="utf8") as fd_small,
        open(big_file, "wt", encoding="utf8") as fd_big,
    ):
        # after leaving this block, all 3 files are closed correctly
        # iterating over an open file-object, yileds lines
        # same for file-objects, which are opened in binary mode
        for line in fd_read:
            # split the Hxxxx value
            # we need only one split
            # and we need the first element [0]

            h_value = line.split(maxsplit=1)[0]
            
            # alternative
            # h_value, _ = line.split(maxsplit=1)
            # 2 targets: h_value and _
            # where the _ is a throw away name

            # giving the user some output
            print("Processing", h_value)

            # the h_value is still a str
            # to compare it with integer
            # you have to convert the value to an int
            # in addition you have to remove the prefix H,
            # to be able to convert the value
            # str.removeprefix and str.removesuffix were added in Python 3.9
            # the function int allows leading whitespaces on the left and right side
            # of the str, but not between the digits
            value = int(h_value.removeprefix("H"))

            # now the decision if the int is bigger or equal than 1000
            if value >= 1000:
                # big int -> big file
                # just write the whole line you get from the for-loop
                # don't use value or h_value, which is only the Hxxxx str
                fd_big.write(line)
            else:
                # if smaller than 1000, then write the
                # line to the small file
                fd_small.write(line)

        # here the block is left, all files are now closed


def get_args():
    parser = ArgumentParser()
    parser.add_argument("input_file", help="Input file where the data comes from")
    parser.add_argument("small_file", help="Small output file" )
    parser.add_argument("big_file", help="Big output file" )
    return parser.parse_args()


if __name__ == "__main__":
    options = vars(get_args())
    # parser.parse_args returns a NameSpace onbjct, which holds the information
    # for filter_file
    # the builtin function `vars` get all attributes from the object
    # and putting them into a dict
    
    # the two starts unpacking the dict
    # where the key is an argument and the value is the input from commandline
    filter_file(**options)
    
The object from parser.parse_args() is an argparse.Namespace. It's a simple object with the attributes you've added to the parser. For myself as a notice: types.Namespace is smaller. An empty Namespace from argparse is 48 bytes big and an empty types.SimpleNamespace from types is only 40 bytes big.

The benefit of code, which uses only the stdlib of Python (batteries included), would run without dependencies.