Python Forum

Full Version: Correct data structure for this problem
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi all !

New to the forum and rather new to Python scripting, but not new to programming in general. I am looking for advice on the data structure to incorporate the below. Looking at the attached picture, I have a structure that looks like an array in Excel VBA. The rows and columns need to be dynamic. Each column will always have 10 fixed entries to start with (shown in yellow), followed by a variable part containing each time a number and a string.

I coded this in Excel VBA as a proof of concept but it's painfully slow. The question: how do I set it up in Python objects ? Is this feasible in a list ? or an array ? A dictionary ? Or nested objects ? There are 1 to 100 columns at most, and 1 to 4000 rows per column, at most.

The source of the data is a text file. Each line will have some information stored in the object. After that, I need to loop over the object to output to a new text file.

Thanks a lot !

[Image: 01.png]
I would recommend a Pandas dataframe. Close to a spreadsheet and with tons of capabilities.
Here is a pretty good tutorial, and Here is a youtube from PyCon 2015 if you prefer that format.
Hi,

Thanks Jef. So you would confirm the following:
- the source file is a text file (with 45,000 lines)
- the target file is a text file (with also 45,000 lines) but the lines are made longer with prefix or suffix
- in my processing of the file, I need to determine a prefix or suffix
- the data can have multiple segments (chunks of text), I store them as columns
- I store as rows both a fixed part of 10 items per segment, as well as a variable part, each time a number and a text

Looping over those "cells" will then allow me to know what prefix or suffix I need.

If this is best approached with a pandas dataframe, I will look into it.
Thanks a lot ! I will keep you posted.
I am just wondering - if input is text file why screenshot of Excel is provided?
Hi,

The source and target is text files. But the reason of the Excel screenshot is that I coded my solution first in Excel VBA, as this is what I do on a daily basis.
But the code runs for 8 minutes before completing. So that's why I am reaching out to Python to speed it up.
I wanted to show the structure of my array to you, such that I would receive the best object / structure in Python to use for this problem.
I understood that a pandas dataframe would be best.
I would say that Excel will give next to none information about source text file. Starting with separator and continuing with overall line structure.
perfringo,

In the source text file is no separator other than a line break. It's 45,000 lines of text below each other.
There are different segments in the data, marked by identifiers. I need to identify the segments. That will be "columns" for me.
Then within segments, there are also many blocks of lines that belong to each other. I also need to track these.
The Excel file is unrelated to the data source, it's only a dump of my array. My array contains information to be able to generate the output file, together with the source text file lines.
I don't understand exactly how your source looks, but pandas read_csv() does not require commas, actually has a lot of options. See docs https://pandas.pydata.org/pandas-docs/st...d_csv.html
Thanks, I will investigate Pandas dataframe and see where I end up.

This is the first few lines of the source. The first 2 or 3 characters on every line need to be taken. Some of these ID's define segments, and within a segment, certain ID's mark other regions. I need to grab these, because part of these lines will become a prefix for other lines.


ISA*00* *00* *ZZ*00000000609 *ZZ*1982611190 *200908*0544*{*00501*005440284*1*P*:~
GS*HP*00000000021*1982611190*20200908*0544*13754194*X*005010X221A1~
ST*835*000001075~
BPR*I*12202.7*C*ACH*CCP*01*081517693*DA*00000152302017024*1351840597**01*021000322*DA*2036371219*20200908~
TRN*1*EFT3378768*1351840597~
DTM*405*20200904~
N1*PR*NATIONAL GOVERNMENT SERVICES #13001~
N3*6325 SECURITY BOULEVARD~
N4*BALTIMORE*MD*21207~
PER*CX*NATIONAL GOVERNMENT SERVICES, INC.*TE*8888554356~
PER*BL*EDI HELP DESK*EM*[email protected]*TE*8772734334*EX*5024232356~
N1*PE*MY MEDICAL CENT*XX*1900611190~
N3*PO BOX 15000-7400~
N4*PHILADELPHIA*PA*191957400~
REF*PQ*1982611190~
REF*TJ*112241326~
LX*112012~
TS3*1982611190*11*20201231*6*262918~
TS2*12805.44*12805.44****512.17*****4*11*11***.198~
CLP*2000000001I73748370*22*-102890*-14262.35**MA*22000100197904NYA*11*8**885*-1~
CAS*CO*45*-87219.65~
some thoughts
looking at the sample data, it looks like separator is asterisk.
don't think pandas will be much help here. you will need to parse lines yourself
If I understand correctly, depending on first segment, you will know what rest of the segments on that line mean.
Probably you may use a namedtuple or write your own class.

by the way, depending where the data comes from there is chance that a library exists to help parse it.
Pages: 1 2