Correct data structure for this problem

Wigi · (This post was last modified: Oct-06-2020, 09:18 PM by Wigi.)

Hi,

I need to parse the lines in the source file, simply looping. Then, split each line on the asterisk.
The first part determines segments (there's an opening tag and a closing tag).
Within each segment, the first part of the line has an ID to denote a different block of lines that belong together (the same transaction).

For the prefix in the output file, it's simply some of the parts of the splitted line text. So asterisk splits up the line and we need to grab the 3rd or 4th item, something like this. That is not the difficult part.

The difficult part is identifying the different segments, then blocks within a segment. I have it all working in VBA, but now I am looking at performance and speed :-) that's why I'm here with Python. The only thing I was unsure of, is how to store the information on line numbers such that I can use it in a second pass to form the output text file.

**buran** · (This post was last modified: Oct-06-2020, 09:26 PM by buran.)

python offers different collection like list, dict, tuple, named tuple, etc.
The point is because we don't know the specifications of the file (and your very general explanation does not help much) we hardly can help with something more concrete.
at least can you tell us where the data comes from or point to some specifications?

Wigi · (This post was last modified: Oct-06-2020, 09:55 PM by Wigi.)

Understood. I will provide more details by showing an anomimized version of the source file.

- A segment is defined by: starting with ISA*, ending with IEA*. This means that in the excerpt below we have 2 segments. The real file can contain in between 1 and 100 segments.
- The CLP lines mark a payment, within segments. So segment 1 has 6 payments, segment 2 has 3 payments. The real file can contain in between 1 and 2,000 payments per section.
- the output text file is equal to the source file, but with a prefix within the segments, in front of the current lines. The prefix is determined by the contents of the segment, we need to for certain codes and grab some of the characters on that line, to be used as a prefix on other lines.

My approach in VBA was to have a 2D array. Columns (dynamic) for the segments, and within each segment, I have rows for the line numbers where certain ID codes are found. 10 fields are fixed (the info in yellow in the screenshot), some are variable based on how many CLP payments we encounter within the segment.

So in Python I need an object to hold a dynamic number of "columns" (segments), a dynamic number of "rows" (payments). Is this a list of lists maybe ? Do I need arrays ?

ISA*00* *00* *ZZ*000000006
GS*HP*00000000021*1982611190*20200908*0544*1
TS3*1982611190*11*20201231*6*262918~
TS2*12805.44*12805.44****512.17*****4*11*11*
CLP*2000000001I73748370*22*-102890*-14262.35
CAS*CO*45*-87219.65~
QTY*CA*-15~
CLP*2000000002I73908052*1*106041*14262.35*14
CAS*CO*45*90370.65~
CAS*PR*1*1408~
AMT*AU*106041~
QTY*CA*15~
CLP*2000000003I1798623*1*73390*11909.61*1408
CAS*CO*45*60072.39~
AMT*AU*73390~
QTY*CA*11~
CLP*2000000004I73906905*4*182611*0**MA*22004
CAS*CO*16*182611*45~
NM1*QC*1*MOUSE*MICKEY****MI*6H00A18PX83~
N1*PE*MY MEDICAL CENT*XX*1200641514~
N3*PO BOX 125000-7400~
N4*PHILADELPHIA*PA*191957400~
REF*PQ*1200641514~
REF*TJ*112241326~
LX*111712~
TS3*1285641514*11*20171231*1*63523.78~
TS2*22945.58*15114.47***1521.94*****5*1**5**
CLP*2000000009I73719738*4*63523.78*0*1316*MA
CAS*CO*29*62207.78~
CAS*PR*1*1316~
NM1*QC*1*MOUSE*MINNIE****MI*2001A83HA09~
LX*112012~
TS3*1285641514*11*20201231*160*10759657.82~
TS2*1992134.24*1361024.14**86068.29*139133.
CLP*2000000005I73812607*1*55402.84*3072.54*
CAS*CO*74*52330.3~
NM1*QC*1*DUCK*DONALD****MI*4W000N2EP80~
GE*1*13754195~
IEA*1*005440685~
ISA*00* *00* *ZZ*0000000
GS*HP*00000000021*1982611190*20200908*0544
TS3*1982611190*11*20201231*6*262918~
TS2*12805.44*12805.44****512.17*****4*11*1
CAS*CO*45*60072.39~
AMT*AU*73390~
QTY*CA*11~
CLP*2000000004I73906905*4*192611*0**MA*220
CAS*CO*16*182611*45~
NM1*QC*1*MOUSE*MICKEY****MI*6H00A18PX83~
N1*PE*MY MEDICAL CENT*XX*1200641514~
N3*PO BOX 125000-7400~
N4*PHILADELPHIA*PA*191957400~
REF*PQ*1200641514~
REF*TJ*112241326~
LX*111712~
TS3*1285641514*11*20171231*1*63523.78~
TS2*22945.58*15114.47***1521.94*****5*1**5***2.2027*1209.02**102.89~
CLP*2000000009I73719738*4*63523.78*0*1316*MA*22002700746204NYA*11*1**286*2.2027*1~
CAS*CO*29*62207.78~
CAS*PR*1*1316~
NM1*QC*1*MOUSE*MINNIE****MI*2001A83HA09~
LX*112012~
TS3*1285641514*11*20201231*160*10759657.82~
TS2*1992134.24*1361024.14**86068.29*139133.04*310147.09**2468.34*41514.58*3*134*647*647***1.2445*107586.27**9963.87~
CLP*2000000005I73812607*1*55402.84*3072.54**MA*22000000176404NYA*11*1**470*1.9684*.952~
CAS*CO*74*52330.3~
NM1*QC*1*DUCK*DONALD****MI*4W000N2EP80~
GE*1*13754195~
IEA*1*005440685~

**buran** · Oct-09-2020, 11:09 AM

This is the third time I start to write this (due to our problems with the site) and lost 2 long drafts, so now I am pissed off and this time my post will be as short as possible.
This is X12 EDI format, HIPAA 835 file to be precise. Don't know why you were reluctant to say so from the start or when I asked.
I looked for specifications online, but it's hard to obtain one free. There are different companion guides available, but they are not exhaustive and at the same time - company specific. I found this one most useful: https://passporthealthplan.com/wp-conten...-guide.pdf
It's still outdated, e.g. CLP segment they show has only 6 elements, while you have more elements in CLP segment.
I am sure you know all this, but I say it for the benefit of the others.
I also found sample file here: https://www.emedny.org/HIPAA/5010/5010_s...index.aspx and downloaded 835 Sample (Professional Claims Only- With Payment) file and saved it as sample835.txt
Now I will work with it.

Output:
ISA*00*          *00*          *ZZ*EMEDNYBAT      *ZZ*ETIN           *100101*1000*^*00501*006000600*0*T*:~GS*HP*EMEDNYBAT*ETIN*20100101*1050*6000600*X*005010X221A1~ST*835*1740~BPR*I*45.75*C*ACH*CCP*01*111*DA*33*1234567890**01*111*DA*22*20100101~TRN*1*10100000000*1000000000~REF*EV*ETIN~DTM*405*20100101~N1*PR*NYSDOH~N3*OFFICE OF HEALTH INSURANCE PROGRAMS*CORNING TOWER, EMPIRE STATE PLAZA~N4*ALBANY*NY*122370080~PER*BL*PROVIDER SERVICES*TE*8003439000*UR*www.emedny.org~N1*PE*MAJOR MEDICAL PROVIDER*XX*9999999995~REF*TJ*000000000~LX*1~CLP*PATIENT ACCOUNT NUMBER*1*34.25*34.25**MC*1000210000000030*11~NM1*QC*1*SUBMITTED LAST*SUBMITTED FIRST****MI*LL99999L~NM1*74*1*CORRECTED LAST*CORRECTED FIRST~REF*EA*PATIENT ACCOUNT NUMBER~DTM*232*20100101~DTM*233*20100101~AMT*AU*34.25~SVC*HC:V2020:RB*6*6**1~DTM*472*20100101~AMT*B6*6~SVC*HC:V2700:RB*2.75*2.75**1~DTM*472*20100101~AMT*B6*2.75~SVC*HC:V2103:RB*5.5*5.5**1~DTM*472*20100101~AMT*B6*5.5~SVC*HC:S0580*20*20**2~DTM*472*20100101~AMT*B6*20~CLP*PATIENT ACCOUNT NUMBER*2*34*0**MC*1000220000000020*11~NM1*QC*1*SUBMITTED LAST*SUBMITTED FIRST****MI*LL88888L~NM1*74*1*CORRECTED LAST*CORRECTED FIRST~REF*EA*PATIENT ACCOUNT NUMBER~DTM*232*20100101~DTM*233*20100101~SVC*HC:V2020*12*0**0~DTM*472*20100101~CAS*CO*29*12~SVC*HC:V2103*22*0**0~DTM*472*20100101~CAS*CO*29*22~CLP*PATIENT ACCOUNT NUMBER*2*34.25*11.5**MC*1000230000000020*11~NM1*QC*1*SUBMITTED LAST*SUBMITTED FIRST****MI*LL77777L~NM1*74*1*CORRECTED LAST*CORRECTED FIRST~REF*EA*PATIENT ACCOUNT NUMBER~DTM*232*20100101~DTM*233*20100101~AMT*AU*11.5~SVC*HC:V2020:RB*6*6**1~DTM*472*20100101~AMT*B6*6~SVC*HC:V2103:RB*5.5*5.5**1~DTM*472*20130917~AMT*B6*5.5~SVC*HC:V2700:RB*2.75*0**0~DTM*472*20100101~CAS*CO*251*2.75~LQ*HE*N206~SVC*HC:S0580*20*0**0~DTM*472*20100101~CAS*CO*251*20~LQ*HE*N206~SE*65*1740~GE*1*6000600~IEA*1*006000600~

My point is you will have deeply nested structure File->Interchange(s) -> Functional group -> Transaction set(s) -> Loop(s) (I may be wrong for some of these, but anyway) and at each nested level you can have either some built-in container like list, dict, tuple, namedtuple etc. or write own class.
What will you choose depends on you - what you plan to do, do you want to validate data, do you plan to expand and so on.

For start very basic example

import pprint
line_sep = '~'
element_sep = '*'
with open(r'.\835\sample835.txt') as f:
    x12 = f.read()

x12 = x12.split(line_sep)
message = []
for segment in x12:
    if segment.startswith('ISA'):
        isa = {} # create empty dict
        isa['ISA'] = segment.split(element_sep)
        isa['payments'] = []
    elif segment.startswith('CLP'):
        payment = segment.split(element_sep)
        isa['payments'].append(payment)
    elif segment.startswith('IEA'):
        message.append(isa)
pprint.pprint(message)

Output:[{'ISA': ['ISA',
          '00',
          '          ',
          '00',
          '          ',
          'ZZ',
          'EMEDNYBAT      ',
          'ZZ',
          'ETIN           ',
          '100101',
          '1000',
          '^',
          '00501',
          '006000600',
          '0',
          'T',
          ':'],
  'payments': [['CLP',
                'PATIENT ACCOUNT NUMBER',
                '1',
                '34.25',
                '34.25',
                '',
                'MC',
                '1000210000000030',
                '11'],
               ['CLP',
                'PATIENT ACCOUNT NUMBER',
                '2',
                '34',
                '0',
                '',
                'MC',
                '1000220000000020',
                '11'],
               ['CLP',
                'PATIENT ACCOUNT NUMBER',
                '2',
                '34.25',
                '11.5',
                '',
                'MC',
                '1000230000000020',
                '11']]}]

As you can see - list (to allow multiple interchange blocks), each interchange will be a dict, the value for key "payments" is again dict list, holding multiple lists, etc.
I work with just ISA and CLP segments, but I guess you will need to work on other segments/loops too.

From here you can expand, e.g. replace lists for each segment with namedtuple

from collections import namedtuple
import pprint

ISA = namedtuple('ISA', ['identifier', 'authorization_information_qualifier', 'authorization_information', 
                         'security_information_qualifier', 'security_information', 'interchange_id_qualifier_isa5',
                        'interchange_sender_id', 'interchange_id_qualifier_isa7', 'interchange_receiver_id',
                        'interchange_date', 'interchange_time', 'interchange_control_standards',
                        'interchange_control_version_number', 'interchange_control_number',
                        'acknowledgement_requested', 'usage_indicator', 'component_element_separator'],
                        defaults=(None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, '>'))

CLP =namedtuple('CLP', ['identifier', 'patient_control_number', 'claim_status_code', 'total_claim_charge_amount',
                        'claim_payment_amount', 'claim_filing_indicator_code_', 'payer_claim_control_number', 'clp07', 'clp08'])


line_sep = '~'
element_sep = '*'
with open(r'.\835\sample835.txt') as f:
    x12 = f.read()

x12 = x12.split(line_sep)

message = []
for segment in x12:
    if segment.startswith('ISA'):
        isa = {} # create empty dict
        isa['ISA'] = ISA(*segment.split(element_sep))
        isa['payments'] = []
    elif segment.startswith('CLP'):
        payment = CLP(*segment.split(element_sep))
        isa['payments'].append(payment)
    elif segment.startswith('IEA'):
        message.append(isa)

pprint.pprint(message)
print('\n')
for isa in message:
    for payment in isa['payments']:
        print(f'Claim payment amount: {payment.claim_payment_amount}')

Output:[{'ISA': ISA(identifier='ISA', authorization_information_qualifier='00', authorization_information='          ', security_information_qualifier='00', security_information='          ', interchange_id_qualifier_isa5='ZZ', interchange_sender_id='EMEDNYBAT      ', interchange_id_qualifier_isa7='ZZ', interchange_receiver_id='ETIN           ', interchange_date='100101', interchange_time='1000', interchange_control_standards='^', interchange_control_version_number='00501', interchange_control_number='006000600', acknowledgement_requested='0', usage_indicator='T', component_element_separator=':'),
  'payments': [CLP(identifier='CLP', patient_control_number='PATIENT ACCOUNT NUMBER', claim_status_code='1', total_claim_charge_amount='34.25', claim_payment_amount='34.25', claim_filing_indicator_code_='', payer_claim_control_number='MC', clp07='1000210000000030', clp08='11'),
               CLP(identifier='CLP', patient_control_number='PATIENT ACCOUNT NUMBER', claim_status_code='2', total_claim_charge_amount='34', claim_payment_amount='0', claim_filing_indicator_code_='', payer_claim_control_number='MC', clp07='1000220000000020', clp08='11'),
               CLP(identifier='CLP', patient_control_number='PATIENT ACCOUNT NUMBER', claim_status_code='2', total_claim_charge_amount='34.25', claim_payment_amount='11.5', claim_filing_indicator_code_='', payer_claim_control_number='MC', clp07='1000230000000020', clp08='11')]}]


Claim payment amount: 34.25
Claim payment amount: 0
Claim payment amount: 11.5

In addition to above, which is my code, I found this https://hiplab.mc.vanderbilt.edu/git/lab/parse-edi
It's not great in terms of quality of python code, easy of installation, etc. I tried to run it but was not very successful with the sample file. Anyway - it may be useful and give you some additional hints if you decide to look at it further.

That's it for now. I apologise if it happened to use incorrect terminology here and there.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How can I add certain elements in this 2d data structure and calculate a mean	TheOddCircle	3	1,564	May-27-2022, 09:09 AM Last Post: paul18fr
	Looking for data/info on a perticular data-proccesing problem.	MvGulik	9	3,908	May-01-2021, 07:43 AM Last Post: MvGulik
	Appropriate data-structure / design for business-day relations (week/month-wise)	sx999	2	2,811	Apr-23-2021, 08:09 AM Last Post: sx999
	what data structure to use?	Winfried	4	2,836	Mar-16-2021, 12:11 PM Last Post: buran
	Yahoo_fin, Pandas: how to convert data table structure in csv file	detlefschmitt	14	7,809	Feb-15-2021, 12:58 PM Last Post: detlefschmitt
	How to use Bunch data structure	moish	2	2,923	Dec-24-2020, 06:25 PM Last Post: deanhystad
	difficulties to chage json data structure using json module in python	Sibdar	1	2,094	Apr-03-2020, 06:47 PM Last Post: micseydel
	File system representation in a data structure	Alfalfa	1	2,072	Dec-18-2019, 01:56 AM Last Post: Alfalfa
	Custom data structure	icm63	2	2,543	Mar-27-2019, 02:40 AM Last Post: icm63
	Nested Data structure question	arjunfen	7	4,288	Feb-22-2019, 02:18 PM Last Post: snippsat

Correct data structure for this problem

User Panel Messages

Announcements