Python Forum
Correct data structure for this problem
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Correct data structure for this problem
#1
Hi all !

New to the forum and rather new to Python scripting, but not new to programming in general. I am looking for advice on the data structure to incorporate the below. Looking at the attached picture, I have a structure that looks like an array in Excel VBA. The rows and columns need to be dynamic. Each column will always have 10 fixed entries to start with (shown in yellow), followed by a variable part containing each time a number and a string.

I coded this in Excel VBA as a proof of concept but it's painfully slow. The question: how do I set it up in Python objects ? Is this feasible in a list ? or an array ? A dictionary ? Or nested objects ? There are 1 to 100 columns at most, and 1 to 4000 rows per column, at most.

The source of the data is a text file. Each line will have some information stored in the object. After that, I need to loop over the object to output to a new text file.

Thanks a lot !

[Image: 01.png]
Reply
#2
I would recommend a Pandas dataframe. Close to a spreadsheet and with tons of capabilities.
Here is a pretty good tutorial, and Here is a youtube from PyCon 2015 if you prefer that format.
Reply
#3
Hi,

Thanks Jef. So you would confirm the following:
- the source file is a text file (with 45,000 lines)
- the target file is a text file (with also 45,000 lines) but the lines are made longer with prefix or suffix
- in my processing of the file, I need to determine a prefix or suffix
- the data can have multiple segments (chunks of text), I store them as columns
- I store as rows both a fixed part of 10 items per segment, as well as a variable part, each time a number and a text

Looping over those "cells" will then allow me to know what prefix or suffix I need.

If this is best approached with a pandas dataframe, I will look into it.
Thanks a lot ! I will keep you posted.
Reply
#4
I am just wondering - if input is text file why screenshot of Excel is provided?
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Hi,

The source and target is text files. But the reason of the Excel screenshot is that I coded my solution first in Excel VBA, as this is what I do on a daily basis.
But the code runs for 8 minutes before completing. So that's why I am reaching out to Python to speed it up.
I wanted to show the structure of my array to you, such that I would receive the best object / structure in Python to use for this problem.
I understood that a pandas dataframe would be best.
Reply
#6
I would say that Excel will give next to none information about source text file. Starting with separator and continuing with overall line structure.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#7
perfringo,

In the source text file is no separator other than a line break. It's 45,000 lines of text below each other.
There are different segments in the data, marked by identifiers. I need to identify the segments. That will be "columns" for me.
Then within segments, there are also many blocks of lines that belong to each other. I also need to track these.
The Excel file is unrelated to the data source, it's only a dump of my array. My array contains information to be able to generate the output file, together with the source text file lines.
Reply
#8
I don't understand exactly how your source looks, but pandas read_csv() does not require commas, actually has a lot of options. See docs https://pandas.pydata.org/pandas-docs/st...d_csv.html
Reply
#9
Thanks, I will investigate Pandas dataframe and see where I end up.

This is the first few lines of the source. The first 2 or 3 characters on every line need to be taken. Some of these ID's define segments, and within a segment, certain ID's mark other regions. I need to grab these, because part of these lines will become a prefix for other lines.


ISA*00* *00* *ZZ*00000000609 *ZZ*1982611190 *200908*0544*{*00501*005440284*1*P*:~
GS*HP*00000000021*1982611190*20200908*0544*13754194*X*005010X221A1~
ST*835*000001075~
BPR*I*12202.7*C*ACH*CCP*01*081517693*DA*00000152302017024*1351840597**01*021000322*DA*2036371219*20200908~
TRN*1*EFT3378768*1351840597~
DTM*405*20200904~
N1*PR*NATIONAL GOVERNMENT SERVICES #13001~
N3*6325 SECURITY BOULEVARD~
N4*BALTIMORE*MD*21207~
PER*CX*NATIONAL GOVERNMENT SERVICES, INC.*TE*8888554356~
PER*BL*EDI HELP DESK*EM*[email protected]*TE*8772734334*EX*5024232356~
N1*PE*MY MEDICAL CENT*XX*1900611190~
N3*PO BOX 15000-7400~
N4*PHILADELPHIA*PA*191957400~
REF*PQ*1982611190~
REF*TJ*112241326~
LX*112012~
TS3*1982611190*11*20201231*6*262918~
TS2*12805.44*12805.44****512.17*****4*11*11***.198~
CLP*2000000001I73748370*22*-102890*-14262.35**MA*22000100197904NYA*11*8**885*-1~
CAS*CO*45*-87219.65~
Reply
#10
some thoughts
looking at the sample data, it looks like separator is asterisk.
don't think pandas will be much help here. you will need to parse lines yourself
If I understand correctly, depending on first segment, you will know what rest of the segments on that line mean.
Probably you may use a namedtuple or write your own class.

by the way, depending where the data comes from there is chance that a library exists to help parse it.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I add certain elements in this 2d data structure and calculate a mean TheOddCircle 3 1,558 May-27-2022, 09:09 AM
Last Post: paul18fr
  Looking for data/info on a perticular data-proccesing problem. MvGulik 9 3,904 May-01-2021, 07:43 AM
Last Post: MvGulik
  Appropriate data-structure / design for business-day relations (week/month-wise) sx999 2 2,811 Apr-23-2021, 08:09 AM
Last Post: sx999
  what data structure to use? Winfried 4 2,827 Mar-16-2021, 12:11 PM
Last Post: buran
  Yahoo_fin, Pandas: how to convert data table structure in csv file detlefschmitt 14 7,795 Feb-15-2021, 12:58 PM
Last Post: detlefschmitt
  How to use Bunch data structure moish 2 2,920 Dec-24-2020, 06:25 PM
Last Post: deanhystad
  difficulties to chage json data structure using json module in python Sibdar 1 2,088 Apr-03-2020, 06:47 PM
Last Post: micseydel
  File system representation in a data structure Alfalfa 1 2,072 Dec-18-2019, 01:56 AM
Last Post: Alfalfa
  Custom data structure icm63 2 2,542 Mar-27-2019, 02:40 AM
Last Post: icm63
  Nested Data structure question arjunfen 7 4,274 Feb-22-2019, 02:18 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020