Python Forum

Full Version: Pattern Split String
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have an interesting challenge have few patterned records and i need to split it intelligently into a dictionary

Eg of records in file are like below. Each Record Type can have different definition. And I need to read that record and map its place holders correctly. The size of record type will remain static or fixed
<Counter 5 digit><Record Type><Rec Defn Attribute 1><Rec Defn Attribute 2>..etc

00001000A20181220233445 NAMEOFACTOR10023.431-84-203,FLAT-2A;BlockC,COUNTRY
00002000A20181220233445 NAMEOFACTOR20023.431-84-203,FLAT-2A;BlockC,COUNTRY

First 5 is counter,
Next 4 is record type defintion
Next 14 is datetime i.e in YYYYMMDDHHMISS format
Next 14 is Name of Person
Next 7 floating point number
Reminder address.

This is sample like this based on definition fo a record type the pattern will be different. How can i have a nice structural Python program which i can define the structure or pattern of a record and split each record in correct definition elements
Thanks Buran .... I have tried this any better sol than this

Can there be any format options rather than hard coding positions for splicer
ARRAY = SPLIT(STRING, FORMAT_STRING)
ARRAY = Split("Textjsdkhjsfhd", "%s20%10f")

or is it better the way i have below

get_vz450dic_frm_str("293456250  sdfjk   dsfsdfds")

def get_pattern_slicer():
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    _master_rec_defn = {}
    _master_rec_defn['000A'] = (['REC_CNTR',  0, 5 , 'Format String', 'Data Type'],
                                ['REC_TYPE',  5, 4],
                                ['DATETIME',  9, 14],
                                ['NAME' ,    23, 14],
                                ['AMOUNT',   37, 8],
                                ['ADDRESS_LINE_1', 45, 75] )

    _master_rec_defn['000B'] = ( ['REC_CNTR', 0, 5],
                                 ['REC_TYPE', 5, 4],
                                 ['BILL_NO', 9, 14],
                                 ['BILL_TAX', 23, 8],
                                 ['BILL_DUE_DATE', 31, 14])

    # Sample Test Data
    samplelines = []
    line1 = "00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1      Contains some address data                              "
    line2 = "00002000BISSNO-CF123    0023.3420180722"
    line3 = "00003000BISSNO-CF124   12327.3420180810"
    samplelines.append(line1)
    samplelines.append(line2)
    samplelines.append(line3)

    # Output in this Array
    line_array_dic_map = []

    # Reading lines and converting into Hash Map
    # Repalce this with file reading
    itr = 0
    while itr < len(samplelines):
        str = samplelines[itr]

        # Always Record Type definition for any record will be between position 5 and 9
        if str[5:9] in _master_rec_defn:
            field_defn = _master_rec_defn[str[5:9]]
            for fld_attr in field_defn:
                dicout = {}
                dicout[fld_attr[0]] = str[int(fld_attr[1]):int(fld_attr[1]+fld_attr[2])]
                line_array_dic_map.append(dicout)

        itr += 1

    print(line_array_dic_map)

# Run the Function
get_pattern_slicer()
yes, you can do mapping. however it's the same technique - slicing

def parse_lines(lines):
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    mapping = {'000A':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'DATETIME':(9, 14), 'NAME':(23, 14),
                       'AMOUNT':(37, 8), 'ADDRESS_LINE_1':(45, 75)},
               '000B':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'BILL_NO':(9, 14), 'BILL_TAX':(23, 8),
                       'BILL_DUE_DATE':(31, 14)}}
    records = []
    for line in lines:
        rec_type = line[5:9]
        record = {key:line[start:start+length] for key, (start, length) in mapping[rec_type].items()}
        records.append(record)
    return records
 
# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
print(parse_lines(lines=sample_lines))
Output:
[{'REC_TYPE': '000A', 'AMOUNT': ' 234.34', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_CNTR': '00001', 'DATETIME': '20181022342320'}, {'BILL_DUE_DATE': '20180722', 'REC_CNTR': '00002', 'BILL_NO': 'ISSNO-CF123 ', 'REC_TYPE': '000B', 'BILL_TAX': ' 0023.34'}, {'BILL_DUE_DATE': '20180810', 'REC_CNTR': '00003', 'BILL_NO': 'ISSNO-CF124 ', 'REC_TYPE': '000B', 'BILL_TAX': '12327.34'}]
you can use OrderedDict from collections or namedtupple for each field for more readability, you can go OOP, etc.
for two more, different approaches, one using struct and one using itertools, see this
https://stackoverflow.com/a/4915359/4046632
from collections import namedtuple
def parse(line):
    Field = namedtuple('Field', field_names=('name', 'start', 'length'))
    mapping = {'000A':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('DATETIME', 9, 14), Field('NAME', 23, 14),
                       Field('AMOUNT', 37, 8), Field('ADDRESS_LINE_1', 45, 75)),
               '000B':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('BILL_NO', 9, 14), Field('BILL_TAX', 23, 8),
                       Field('BILL_DUE_DATE', 31, 14))}
    rec_type = line[5:9]
    # rec_field_names = (fld.name for fld in mapping[rec_type])
    # Record = namedtuple("Record", field_names=rec_field_names)
    # return Record(**{fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]})
    return {fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]}


# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
data = [parse(line) for line in sample_lines]
for rec in data:
    print(rec)
this will output
Output:
{'AMOUNT': ' 234.34', 'REC_CNTR': '00001', 'DATETIME': '20181022342320', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_TYPE': '000A'} {'BILL_NO': 'ISSNO-CF123 ', 'BILL_DUE_DATE': '20180722', 'BILL_TAX': ' 0023.34', 'REC_CNTR': '00002', 'REC_TYPE': '000B'} {'BILL_NO': 'ISSNO-CF124 ', 'BILL_DUE_DATE': '20180810', 'BILL_TAX': '12327.34', 'REC_CNTR': '00003', 'REC_TYPE': '000B'}
if you uncomment lines #11-#13 and comment out line#14 you will go one step further and use namedtuple also for record, not only fields
Output:
Record(REC_CNTR='00001', REC_TYPE='000A', DATETIME='20181022342320', NAME='Mr.ABC DEF VAL', AMOUNT=' 234.34', ADDRESS_LINE_1='ADDRESS LINE1') Record(REC_CNTR='00002', REC_TYPE='000B', BILL_NO='ISSNO-CF123 ', BILL_TAX=' 0023.34', BILL_DUE_DATE='20180722') Record(REC_CNTR='00003', REC_TYPE='000B', BILL_NO='ISSNO-CF124 ', BILL_TAX='12327.34', BILL_DUE_DATE='20180810')
Perfect !!! Thanks a lot Buran

I was looking exactly that encode and decode format string as its done in Perl

Struct is the fastest in time processing so will take that route.