Posts: 8
Threads: 3
Joined: Aug 2018
I have an interesting challenge have few patterned records and i need to split it intelligently into a dictionary
Eg of records in file are like below. Each Record Type can have different definition. And I need to read that record and map its place holders correctly. The size of record type will remain static or fixed
<Counter 5 digit><Record Type><Rec Defn Attribute 1><Rec Defn Attribute 2>..etc
00001000A20181220233445 NAMEOFACTOR10023.431-84-203,FLAT-2A;BlockC,COUNTRY
00002000A20181220233445 NAMEOFACTOR20023.431-84-203,FLAT-2A;BlockC,COUNTRY
First 5 is counter,
Next 4 is record type defintion
Next 14 is datetime i.e in YYYYMMDDHHMISS format
Next 14 is Name of Person
Next 7 floating point number
Reminder address.
This is sample like this based on definition fo a record type the pattern will be different. How can i have a nice structural Python program which i can define the structure or pattern of a record and split each record in correct definition elements
Posts: 8,154
Threads: 160
Joined: Sep 2016
Posts: 8
Threads: 3
Joined: Aug 2018
Thanks Buran .... I have tried this any better sol than this
Can there be any format options rather than hard coding positions for splicer
ARRAY = SPLIT(STRING, FORMAT_STRING)
ARRAY = Split("Textjsdkhjsfhd", "%s20%10f")
or is it better the way i have below
get_vz450dic_frm_str("293456250 sdfjk dsfsdfds")
def get_pattern_slicer():
# Create a Master mapping keys for all record Type
# Maintain a Structure of positions and split in a definition configuration yaml
# [Field Name, String Start Position, Characters to split from position]
_master_rec_defn = {}
_master_rec_defn['000A'] = (['REC_CNTR', 0, 5 , 'Format String', 'Data Type'],
['REC_TYPE', 5, 4],
['DATETIME', 9, 14],
['NAME' , 23, 14],
['AMOUNT', 37, 8],
['ADDRESS_LINE_1', 45, 75] )
_master_rec_defn['000B'] = ( ['REC_CNTR', 0, 5],
['REC_TYPE', 5, 4],
['BILL_NO', 9, 14],
['BILL_TAX', 23, 8],
['BILL_DUE_DATE', 31, 14])
# Sample Test Data
samplelines = []
line1 = "00001000A20181022342320Mr.ABC DEF VAL 234.34ADDRESS LINE1 Contains some address data "
line2 = "00002000BISSNO-CF123 0023.3420180722"
line3 = "00003000BISSNO-CF124 12327.3420180810"
samplelines.append(line1)
samplelines.append(line2)
samplelines.append(line3)
# Output in this Array
line_array_dic_map = []
# Reading lines and converting into Hash Map
# Repalce this with file reading
itr = 0
while itr < len(samplelines):
str = samplelines[itr]
# Always Record Type definition for any record will be between position 5 and 9
if str[5:9] in _master_rec_defn:
field_defn = _master_rec_defn[str[5:9]]
for fld_attr in field_defn:
dicout = {}
dicout[fld_attr[0]] = str[int(fld_attr[1]):int(fld_attr[1]+fld_attr[2])]
line_array_dic_map.append(dicout)
itr += 1
print(line_array_dic_map)
# Run the Function
get_pattern_slicer()
Posts: 8,154
Threads: 160
Joined: Sep 2016
Aug-10-2018, 07:01 PM
(This post was last modified: Aug-10-2018, 07:01 PM by buran.)
yes, you can do mapping. however it's the same technique - slicing
def parse_lines(lines):
# Create a Master mapping keys for all record Type
# Maintain a Structure of positions and split in a definition configuration yaml
# [Field Name, String Start Position, Characters to split from position]
mapping = {'000A':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
'DATETIME':(9, 14), 'NAME':(23, 14),
'AMOUNT':(37, 8), 'ADDRESS_LINE_1':(45, 75)},
'000B':{'REC_CNTR':(0, 5), 'REC_TYPE':(5, 4),
'BILL_NO':(9, 14), 'BILL_TAX':(23, 8),
'BILL_DUE_DATE':(31, 14)}}
records = []
for line in lines:
rec_type = line[5:9]
record = {key:line[start:start+length] for key, (start, length) in mapping[rec_type].items()}
records.append(record)
return records
# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL 234.34ADDRESS LINE1",
"00002000BISSNO-CF123 0023.3420180722",
"00003000BISSNO-CF124 12327.3420180810"]
print(parse_lines(lines=sample_lines)) Output: [{'REC_TYPE': '000A', 'AMOUNT': ' 234.34', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_CNTR': '00001', 'DATETIME': '20181022342320'}, {'BILL_DUE_DATE': '20180722', 'REC_CNTR': '00002', 'BILL_NO': 'ISSNO-CF123 ', 'REC_TYPE': '000B', 'BILL_TAX': ' 0023.34'}, {'BILL_DUE_DATE': '20180810', 'REC_CNTR': '00003', 'BILL_NO': 'ISSNO-CF124 ', 'REC_TYPE': '000B', 'BILL_TAX': '12327.34'}]
you can use OrderedDict from collections or namedtupple for each field for more readability, you can go OOP, etc.
Posts: 8,154
Threads: 160
Joined: Sep 2016
Aug-10-2018, 07:38 PM
(This post was last modified: Aug-10-2018, 07:39 PM by buran.)
for two more, different approaches, one using struct and one using itertools, see this
https://stackoverflow.com/a/4915359/4046632
Posts: 8,154
Threads: 160
Joined: Sep 2016
from collections import namedtuple
def parse(line):
Field = namedtuple('Field', field_names=('name', 'start', 'length'))
mapping = {'000A':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
Field('DATETIME', 9, 14), Field('NAME', 23, 14),
Field('AMOUNT', 37, 8), Field('ADDRESS_LINE_1', 45, 75)),
'000B':(Field('REC_CNTR', 0, 5), Field('REC_TYPE', 5, 4),
Field('BILL_NO', 9, 14), Field('BILL_TAX', 23, 8),
Field('BILL_DUE_DATE', 31, 14))}
rec_type = line[5:9]
# rec_field_names = (fld.name for fld in mapping[rec_type])
# Record = namedtuple("Record", field_names=rec_field_names)
# return Record(**{fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]})
return {fld.name:line[fld.start:fld.start+fld.length] for fld in mapping[rec_type]}
# Sample Test Data
sample_lines = ["00001000A20181022342320Mr.ABC DEF VAL 234.34ADDRESS LINE1",
"00002000BISSNO-CF123 0023.3420180722",
"00003000BISSNO-CF124 12327.3420180810"]
data = [parse(line) for line in sample_lines]
for rec in data:
print(rec) this will output
Output: {'AMOUNT': ' 234.34', 'REC_CNTR': '00001', 'DATETIME': '20181022342320', 'ADDRESS_LINE_1': 'ADDRESS LINE1', 'NAME': 'Mr.ABC DEF VAL', 'REC_TYPE': '000A'}
{'BILL_NO': 'ISSNO-CF123 ', 'BILL_DUE_DATE': '20180722', 'BILL_TAX': ' 0023.34', 'REC_CNTR': '00002', 'REC_TYPE': '000B'}
{'BILL_NO': 'ISSNO-CF124 ', 'BILL_DUE_DATE': '20180810', 'BILL_TAX': '12327.34', 'REC_CNTR': '00003', 'REC_TYPE': '000B'}
if you uncomment lines #11-#13 and comment out line#14 you will go one step further and use namedtuple also for record, not only fields
Output: Record(REC_CNTR='00001', REC_TYPE='000A', DATETIME='20181022342320', NAME='Mr.ABC DEF VAL', AMOUNT=' 234.34', ADDRESS_LINE_1='ADDRESS LINE1')
Record(REC_CNTR='00002', REC_TYPE='000B', BILL_NO='ISSNO-CF123 ', BILL_TAX=' 0023.34', BILL_DUE_DATE='20180722')
Record(REC_CNTR='00003', REC_TYPE='000B', BILL_NO='ISSNO-CF124 ', BILL_TAX='12327.34', BILL_DUE_DATE='20180810')
Posts: 8
Threads: 3
Joined: Aug 2018
Perfect !!! Thanks a lot Buran
I was looking exactly that encode and decode format string as its done in Perl
Struct is the fastest in time processing so will take that route.
|