Python Forum
split txt file data on the first column value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
split txt file data on the first column value
#1
I have a bunch of text files that look this.
H0002   Version 3                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAA                                                                                                                     
H0999   Tene_no/Combined_rept_no    E79/38975 
H1000	Sae_Id	GAM_E	GAM_N                                                                                                                      
H1001   Tene_holder Magnetic Resources NL  
Want to seperate text data based on the first column value. First column start with H and followed by a number. If a number is less than 1000, I want to save as file1.txt and if a number is greater or equal to 1000 I want to save in a different txt file2.txt.

File1.txt
H0002   Version 3                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAA                                                                                                                     
H0999   Tene_no/Combined_rept_no    E79/38975 
File2.txt
H1000	Sae_Id	GAM_E	GAM_N                                                                                                                      
H1001   Tene_holder Magnetic Resources NL  
My python code:

import warnings
from pathlib import Path
import time
import parser
import argparse
import pandas as pd

pd.set_option('display.max_rows', None)

warnings.filterwarnings('ignore')

parser = argparse.ArgumentParser(description='Process some integers.')

parser.add_argument('-path', help='define the directory to folder/file')
parser.add_argument('-path_save', help='define where to save the file')
parser.add_argument('--verbose', help='display processing information')

start = time.time()


def main(path_txt, path_save, verbose):

    if path_txt.is_file():
        txt_files = [Path(path_txt)]  # For One File
    else:
        txt_files = list(Path(path_txt).glob("*.txt"))

    for fn in txt_files:
       with open(fn) as f:
        text = f.read().strip()
        print(text)
        
if __name__ == '__main__':
    start = time.time()
    args = parser.parse_args()
    path = Path(args.path)
    path_save = Path(args.path_save)
    verbose = args.verbose
    main(path, path_save, verbose)  # Calling Main Function
    print('Processed time:', time.time() - start)  # Total Time  



Any help, How to get this task done?
Reply
#2
I think something like this is easier done without pandas:
from pathlib import Path
import os
import time

def split_data():
    # Start in same dir as script
    os.chdir(os.path.abspath(os.path.dirname(__file__)))

    homepath = Path('.')
    infile = homepath / 'File0.txt'
    out1 = homepath / 'File1.txt'
    out2 = homepath / 'File2.txt'

    with infile.open() as fp, out1.open('w') as fout1, out2.open('w') as fout2:
        for line in fp:
            linex = line.strip().split()
            if int(linex[0][1:]) > 999:
                fout1.write(line)
            else:
                fout2.write(line)


if __name__ == '__main__':
    split_data()
BashBedlam likes this post
Reply
#3
Similar example with some comments.
In this code, I don't use pathlib.Path.
Using some new language features. It should run with Python 3.9.



#!/usr/bin/env python3

from argparse import ArgumentParser


def filter_file(input_file, small_file, big_file):
    # open first file as input (read-only text utf8)
    # open the second file as write-text utf8
    # open the thrid file as write-text utf8

    with (
        open(input_file, "rt", encoding="utf8") as fd_read,
        open(small_file, "wt", encoding="utf8") as fd_small,
        open(big_file, "wt", encoding="utf8") as fd_big,
    ):
        # after leaving this block, all 3 files are closed correctly
        # iterating over an open file-object, yileds lines
        # same for file-objects, which are opened in binary mode
        for line in fd_read:
            # split the Hxxxx value
            # we need only one split
            # and we need the first element [0]

            h_value = line.split(maxsplit=1)[0]
            
            # alternative
            # h_value, _ = line.split(maxsplit=1)
            # 2 targets: h_value and _
            # where the _ is a throw away name

            # giving the user some output
            print("Processing", h_value)

            # the h_value is still a str
            # to compare it with integer
            # you have to convert the value to an int
            # in addition you have to remove the prefix H,
            # to be able to convert the value
            # str.removeprefix and str.removesuffix were added in Python 3.9
            # the function int allows leading whitespaces on the left and right side
            # of the str, but not between the digits
            value = int(h_value.removeprefix("H"))

            # now the decision if the int is bigger or equal than 1000
            if value >= 1000:
                # big int -> big file
                # just write the whole line you get from the for-loop
                # don't use value or h_value, which is only the Hxxxx str
                fd_big.write(line)
            else:
                # if smaller than 1000, then write the
                # line to the small file
                fd_small.write(line)

        # here the block is left, all files are now closed


def get_args():
    parser = ArgumentParser()
    parser.add_argument("input_file", help="Input file where the data comes from")
    parser.add_argument("small_file", help="Small output file" )
    parser.add_argument("big_file", help="Big output file" )
    return parser.parse_args()


if __name__ == "__main__":
    options = vars(get_args())
    # parser.parse_args returns a NameSpace onbjct, which holds the information
    # for filter_file
    # the builtin function `vars` get all attributes from the object
    # and putting them into a dict
    
    # the two starts unpacking the dict
    # where the key is an argument and the value is the input from commandline
    filter_file(**options)
    
The object from parser.parse_args() is an argparse.Namespace. It's a simple object with the attributes you've added to the parser. For myself as a notice: types.Namespace is smaller. An empty Namespace from argparse is 48 bytes big and an empty types.SimpleNamespace from types is only 40 bytes big.

The benefit of code, which uses only the stdlib of Python (batteries included), would run without dependencies.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help copying a column from a csv to another file with some extras g0nz0uk 3 474 Feb-01-2024, 03:12 PM
Last Post: DeaD_EyE
  Returning Column and Row Data From Spreadsheet knight2000 0 451 Oct-22-2023, 07:07 AM
Last Post: knight2000
  How to "tee" (=split) output to screen and into file? pstein 6 1,409 Jun-24-2023, 08:00 AM
Last Post: Gribouillis
  Database that can compress a column, or all data, automatically? Calab 3 1,197 May-22-2023, 03:25 AM
Last Post: Calab
  Code for pullng all data in a column EmBeck87 5 1,120 Apr-03-2023, 03:43 PM
Last Post: deanhystad
  Split pdf in pypdf based upon file regex standenman 1 2,101 Feb-03-2023, 12:01 PM
Last Post: SpongeB0B
  How to read csv file update matplotlib column chart regularly SamLiu 2 1,074 Jan-21-2023, 11:33 PM
Last Post: SamLiu
  counting lines in split data Skaperen 6 1,427 Oct-07-2022, 07:09 PM
Last Post: Skaperen
  Read xml column inside csv file with Python estertabita 2 1,366 Jul-26-2022, 06:09 PM
Last Post: Larz60+
  How to split file by same values from column from imported CSV file? Paqqno 5 2,806 Mar-24-2022, 05:25 PM
Last Post: Paqqno

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020