Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Pattern Split String
#1
I have an interesting challenge have few patterned records and i need to split it intelligently into a dictionary

Eg of records in file are like below. Each Record Type can have different definition. And I need to read that record and map its place holders correctly. The size of record type will remain static or fixed
<Counter 5 digit><Record Type><Rec Defn Attribute 1><Rec Defn Attribute 2>..etc

00001000A20181220233445 NAMEOFACTOR10023.431-84-203,FLAT-2A;BlockC,COUNTRY
00002000A20181220233445 NAMEOFACTOR20023.431-84-203,FLAT-2A;BlockC,COUNTRY

First 5 is counter,
Next 4 is record type defintion
Next 14 is datetime i.e in YYYYMMDDHHMISS format
Next 14 is Name of Person
Next 7 floating point number
Reminder address.

This is sample like this based on definition fo a record type the pattern will be different. How can i have a nice structural Python program which i can define the structure or pattern of a record and split each record in correct definition elements
Reply
#2
use slicing
https://docs.python.org/3/tutorial/intro...ml#strings

https://www.digitalocean.com/community/t...n-python-3
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Thanks Buran .... I have tried this any better sol than this

Can there be any format options rather than hard coding positions for splicer
ARRAY = SPLIT(STRING, FORMAT_STRING)
ARRAY = Split("Textjsdkhjsfhd", "%s20%10f")

or is it better the way i have below

get_vz450dic_frm_str("293456250  sdfjk   dsfsdfds")

def get_pattern_slicer():
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    _master_rec_defn = {}
    _master_rec_defn['000A'] = (['REC_CNTR',  0, 5 , 'Format String', 'Data Type'],
                                ['REC_TYPE',  5, 4],
                                ['DATETIME',  9, 14],
                                ['NAME' ,    23, 14],
                                ['AMOUNT',   37, 8],
                                ['ADDRESS_LINE_1', 45, 75] )

    _master_rec_defn['000B'] = ( ['REC_CNTR', 0, 5],
                                 ['REC_TYPE', 5, 4],
                                 ['BILL_NO', 9, 14],
                                 ['BILL_TAX', 23, 8],
                                 ['BILL_DUE_DATE', 31, 14])

    # Sample Test Data
    samplelines = []
    line1 = "00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1      Contains some address data                              "
    line2 = "00002000BISSNO-CF123    0023.3420180722"
    line3 = "00003000BISSNO-CF124   12327.3420180810"
    samplelines.append(line1)
    samplelines.append(line2)
    samplelines.append(line3)

    # Output in this Array
    line_array_dic_map = []

    # Reading lines and converting into Hash Map
    # Repalce this with file reading
    itr = 0
    while itr < len(samplelines):
        str = samplelines[itr]

        # Always Record Type definition for any record will be between position 5 and 9
        if str[5:9] in _master_rec_defn:
            field_defn = _master_rec_defn[str[5:9]]
            for fld_attr in field_defn:
                dicout = {}
                dicout[fld_attr[0]] = str[int(fld_attr[1]):int(fld_attr[1]+fld_attr[2])]
                line_array_dic_map.append(dicout)

        itr += 1

    print(line_array_dic_map)

# Run the Function
get_pattern_slicer()
Reply
#4
yes, you can do mapping. however it's the same technique - slicing

def parse_lines(lines):
    # Create a Master mapping keys for all record Type
    # Maintain a Structure of positions and split in a definition configuration yaml
    # [Field Name, String Start Position, Characters to split from position]
    mapping = {'000A':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'DATETIME':(9, 14), 'NAME':(23, 14),
                       'AMOUNT':(37, 8), 'ADDRESS_LINE_1':(45, 75)},
               '000B':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
                       'BILL_NO':(9, 14), 'BILL_TAX':(23, 8),
                       'BILL_DUE_DATE':(31, 14)}}
    records = []
    for line in lines:
        rec_type = line[5:9]
        record = {key:line[start:start+length] for key, (start, length) in mapping[rec_type].items()}
        records.append(record)
    return records
 
# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
print(parse_lines(lines=sample_lines))
Output:
[{'REC_TYPE': '000A', 'AMOUNT': ' 234.34', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_CNTR': '00001', 'DATETIME': '20181022342320'}, {'BILL_DUE_DATE': '20180722', 'REC_CNTR': '00002', 'BILL_NO': 'ISSNO-CF123 ', 'REC_TYPE': '000B', 'BILL_TAX': ' 0023.34'}, {'BILL_DUE_DATE': '20180810', 'REC_CNTR': '00003', 'BILL_NO': 'ISSNO-CF124 ', 'REC_TYPE': '000B', 'BILL_TAX': '12327.34'}]
you can use OrderedDict from collections or namedtupple for each field for more readability, you can go OOP, etc.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
for two more, different approaches, one using struct and one using itertools, see this
https://stackoverflow.com/a/4915359/4046632
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#6
from collections import namedtuple
def parse(line):
    Field = namedtuple('Field', field_names=('name', 'start', 'length'))
    mapping = {'000A':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('DATETIME', 9, 14), Field('NAME', 23, 14),
                       Field('AMOUNT', 37, 8), Field('ADDRESS_LINE_1', 45, 75)),
               '000B':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
                       Field('BILL_NO', 9, 14), Field('BILL_TAX', 23, 8),
                       Field('BILL_DUE_DATE', 31, 14))}
    rec_type = line[5:9]
    # rec_field_names = (fld.name for fld in mapping[rec_type])
    # Record = namedtuple("Record", field_names=rec_field_names)
    # return Record(**{fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]})
    return {fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]}


# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL  234.34ADDRESS LINE1",
                "00002000BISSNO-CF123    0023.3420180722",
                "00003000BISSNO-CF124   12327.3420180810"]
 
data = [parse(line) for line in sample_lines]
for rec in data:
    print(rec)
this will output
Output:
{'AMOUNT': ' 234.34', 'REC_CNTR': '00001', 'DATETIME': '20181022342320', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_TYPE': '000A'} {'BILL_NO': 'ISSNO-CF123 ', 'BILL_DUE_DATE': '20180722', 'BILL_TAX': ' 0023.34', 'REC_CNTR': '00002', 'REC_TYPE': '000B'} {'BILL_NO': 'ISSNO-CF124 ', 'BILL_DUE_DATE': '20180810', 'BILL_TAX': '12327.34', 'REC_CNTR': '00003', 'REC_TYPE': '000B'}
if you uncomment lines #11-#13 and comment out line#14 you will go one step further and use namedtuple also for record, not only fields
Output:
Record(REC_CNTR='00001', REC_TYPE='000A', DATETIME='20181022342320', NAME='Mr.ABC DEF VAL', AMOUNT=' 234.34', ADDRESS_LINE_1='ADDRESS LINE1') Record(REC_CNTR='00002', REC_TYPE='000B', BILL_NO='ISSNO-CF123 ', BILL_TAX=' 0023.34', BILL_DUE_DATE='20180722') Record(REC_CNTR='00003', REC_TYPE='000B', BILL_NO='ISSNO-CF124 ', BILL_TAX='12327.34', BILL_DUE_DATE='20180810')
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Perfect !!! Thanks a lot Buran

I was looking exactly that encode and decode format string as its done in Perl

Struct is the fastest in time processing so will take that route.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  doing string split with 2 or more split characters Skaperen 22 2,470 Aug-13-2023, 01:57 AM
Last Post: Skaperen
Sad How to split a String from Text Input into 40 char chunks? lastyle 7 1,120 Aug-01-2023, 09:36 AM
Last Post: Pedroski55
  [split] Parse Nested JSON String in Python mmm07 4 1,507 Mar-28-2023, 06:07 PM
Last Post: snippsat
  Split string using variable found in a list japo85 2 1,293 Jul-11-2022, 08:52 AM
Last Post: japo85
  Split string knob 2 1,861 Nov-19-2021, 10:27 AM
Last Post: ghoul
  Split string between two different delimiters, with exceptions DreamingInsanity 2 2,007 Aug-24-2020, 08:23 AM
Last Post: DreamingInsanity
  split string enigma619 1 2,060 May-20-2020, 02:47 PM
Last Post: perfringo
  Split string with multiple delimiters and keep the string in "groups" DreamingInsanity 4 6,464 May-12-2020, 09:31 AM
Last Post: DeaD_EyE
  Split a long string into other strings with no delimiters/characters krewlaz 4 2,757 Nov-15-2019, 02:48 PM
Last Post: ichabod801
  input string split Eric7Giants 3 2,993 Nov-13-2019, 07:19 PM
Last Post: Gribouillis

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020