Pattern Split String

sripylearn · Aug-10-2018, 04:03 PM

I have an interesting challenge have few patterned records and i need to split it intelligently into a dictionary

Eg of records in file are like below. Each Record Type can have different definition. And I need to read that record and map its place holders correctly. The size of record type will remain static or fixed
<Counter 5 digit><Record Type><Rec Defn Attribute 1><Rec Defn Attribute 2>..etc

00001000A20181220233445 NAMEOFACTOR10023.431-84-203,FLAT-2A;BlockC,COUNTRY
00002000A20181220233445 NAMEOFACTOR20023.431-84-203,FLAT-2A;BlockC,COUNTRY

First 5 is counter,
Next 4 is record type defintion
Next 14 is datetime i.e in YYYYMMDDHHMISS format
Next 14 is Name of Person
Next 7 floating point number
Reminder address.

This is sample like this based on definition fo a record type the pattern will be different. How can i have a nice structural Python program which i can define the structure or pattern of a record and split each record in correct definition elements

**buran** · Aug-10-2018, 04:10 PM

use slicing
https://docs.python.org/3/tutorial/intro...ml#strings

https://www.digitalocean.com/community/t...n-python-3

sripylearn · Aug-10-2018, 05:19 PM

Thanks Buran .... I have tried this any better sol than this

Can there be any format options rather than hard coding positions for splicer
ARRAY = SPLIT(STRING, FORMAT_STRING)
ARRAY = Split("Textjsdkhjsfhd", "%s20%10f")

or is it better the way i have below

get_vz450dic_frm_str("293456250  sdfjk   dsfsdfds")

def get_pattern_slicer():
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    _master_rec_defn = {}
    _master_rec_defn['000A'] = (['REC_CNTR',  0, 5 , 'Format String', 'Data Type'],
                                ['REC_TYPE',  5, 4],
                                ['DATETIME',  9, 14],
                                ['NAME' ,    23, 14],
                                ['AMOUNT',   37, 8],
                                ['ADDRESS_LINE_1', 45, 75] )

    _master_rec_defn['000B'] = ( ['REC_CNTR', 0, 5],
                                 ['REC_TYPE', 5, 4],
                                 ['BILL_NO', 9, 14],
                                 ['BILL_TAX', 23, 8],
                                 ['BILL_DUE_DATE', 31, 14])

    # Sample Test Data
    samplelines = []
    line1 = "00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1      Contains some address data                              "
    line2 = "00002000BISSNO-CF123    0023.3420180722"
    line3 = "00003000BISSNO-CF124   12327.3420180810"
    samplelines.append(line1)
    samplelines.append(line2)
    samplelines.append(line3)

    # Output in this Array
    line_array_dic_map = []

    # Reading lines and converting into Hash Map
    # Repalce this with file reading
    itr = 0
    while itr < len(samplelines):
        str = samplelines[itr]

        # Always Record Type definition for any record will be between position 5 and 9
        if str[5:9] in _master_rec_defn:
            field_defn = _master_rec_defn[str[5:9]]
            for fld_attr in field_defn:
                dicout = {}
                dicout[fld_attr[0]] = str[int(fld_attr[1]):int(fld_attr[1]+fld_attr[2])]
                line_array_dic_map.append(dicout)

        itr += 1

    print(line_array_dic_map)

# Run the Function
get_pattern_slicer()

**buran** · (This post was last modified: Aug-10-2018, 07:01 PM by buran.)

yes, you can do mapping. however it's the same technique - slicing

def parse_lines(lines):
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    mapping = {'000A':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'DATETIME':(9, 14), 'NAME':(23, 14),
                       'AMOUNT':(37, 8), 'ADDRESS_LINE_1':(45, 75)},
               '000B':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'BILL_NO':(9, 14), 'BILL_TAX':(23, 8),
                       'BILL_DUE_DATE':(31, 14)}}
    records = []
    for line in lines:
        rec_type = line[5:9]
        record = {key:line[start:start+length] for key, (start, length) in mapping[rec_type].items()}
        records.append(record)
    return records
 
# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
print(parse_lines(lines=sample_lines))

Output:
[{'REC_TYPE': '000A', 'AMOUNT': '  234.34', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_CNTR': '00001', 'DATETIME': '20181022342320'}, {'BILL_DUE_DATE': '20180722', 'REC_CNTR': '00002', 'BILL_NO': 'ISSNO-CF123   ', 'REC_TYPE': '000B', 'BILL_TAX': ' 0023.34'}, {'BILL_DUE_DATE': '20180810', 'REC_CNTR': '00003', 'BILL_NO': 'ISSNO-CF124   ', 'REC_TYPE': '000B', 'BILL_TAX': '12327.34'}]

you can use OrderedDict from collections or namedtupple for each field for more readability, you can go OOP, etc.

**buran** · (This post was last modified: Aug-10-2018, 07:39 PM by buran.)

for two more, different approaches, one using struct and one using itertools, see this
https://stackoverflow.com/a/4915359/4046632

**buran** · Aug-10-2018, 08:07 PM

from collections import namedtuple
def parse(line):
    Field = namedtuple('Field', field_names=('name', 'start', 'length'))
    mapping = {'000A':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('DATETIME', 9, 14), Field('NAME', 23, 14),
                       Field('AMOUNT', 37, 8), Field('ADDRESS_LINE_1', 45, 75)),
               '000B':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('BILL_NO', 9, 14), Field('BILL_TAX', 23, 8),
                       Field('BILL_DUE_DATE', 31, 14))}
    rec_type = line[5:9]
    # rec_field_names = (fld.name for fld in mapping[rec_type])
    # Record = namedtuple("Record", field_names=rec_field_names)
    # return Record(**{fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]})
    return {fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]}


# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
data = [parse(line) for line in sample_lines]
for rec in data:
    print(rec)

this will output

Output:{'AMOUNT': '  234.34', 'REC_CNTR': '00001', 'DATETIME': '20181022342320', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_TYPE': '000A'}
{'BILL_NO': 'ISSNO-CF123   ', 'BILL_DUE_DATE': '20180722', 'BILL_TAX': ' 0023.34', 'REC_CNTR': '00002', 'REC_TYPE': '000B'}
{'BILL_NO': 'ISSNO-CF124   ', 'BILL_DUE_DATE': '20180810', 'BILL_TAX': '12327.34', 'REC_CNTR': '00003', 'REC_TYPE': '000B'}

if you uncomment lines #11-#13 and comment out line#14 you will go one step further and use namedtuple also for record, not only fields

Output:Record(REC_CNTR='00001', REC_TYPE='000A', DATETIME='20181022342320', NAME='Mr.ABC DEF VAL', AMOUNT='  234.34', ADDRESS_LINE_1='ADDRESS LINE1')
Record(REC_CNTR='00002', REC_TYPE='000B', BILL_NO='ISSNO-CF123   ', BILL_TAX=' 0023.34', BILL_DUE_DATE='20180722')
Record(REC_CNTR='00003', REC_TYPE='000B', BILL_NO='ISSNO-CF124   ', BILL_TAX='12327.34', BILL_DUE_DATE='20180810')

sripylearn · Aug-12-2018, 04:20 AM

Perfect !!! Thanks a lot Buran

I was looking exactly that encode and decode format string as its done in Perl

Struct is the fastest in time processing so will take that route.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Unable to understand the function string.split()	Hudjefa	8	2,265	Sep-16-2024, 04:25 AM Last Post: Pedroski55
	doing string split with 2 or more split characters	Skaperen	22	6,088	Aug-13-2023, 01:57 AM Last Post: Skaperen
	How to split a String from Text Input into 40 char chunks?	lastyle	7	2,551	Aug-01-2023, 09:36 AM Last Post: Pedroski55
	[split] Parse Nested JSON String in Python	mmm07	4	2,696	Mar-28-2023, 06:07 PM Last Post: snippsat
	Split string using variable found in a list	japo85	2	2,125	Jul-11-2022, 08:52 AM Last Post: japo85
	Split string	knob	2	2,559	Nov-19-2021, 10:27 AM Last Post: ghoul
	Split string between two different delimiters, with exceptions	DreamingInsanity	2	2,917	Aug-24-2020, 08:23 AM Last Post: DreamingInsanity
	split string	enigma619	1	2,652	May-20-2020, 02:47 PM Last Post: perfringo
	Split string with multiple delimiters and keep the string in "groups"	DreamingInsanity	4	13,114	May-12-2020, 09:31 AM Last Post: DeaD_EyE
	Split a long string into other strings with no delimiters/characters	krewlaz	4	3,815	Nov-15-2019, 02:48 PM Last Post: ichabod801

Pattern Split String

User Panel Messages

Announcements