looking for sweeter code to compare parts of a list

Skaperen · (This post was last modified: Jun-18-2017, 05:21 AM by Skaperen.)

i have a big (nearly a million lines) file that is being read in, one line at a time. each line is .split() and there are about 3 dozen tokens for each line. several tokens are checked to select which lines are to be used. the check is an equality check for a few different tokens. the tokens being checked are not contiguous, such as 4, 5, 9, 10, and 12 are checked. i am doing these checks with a big long if statement with lots of ands. i am wondering if there is any sweeter way to code this kind of thing. a program i am working on today is doing a lot of this kind of thing, from lots of cloud data i have.

    datadict = {}
    for line in sys.stdin:
       tokens = line.rstrip().split()
       if tokens[4] == 'foo' and\
          tokens[5] == 'bar' and\
          tokens[9] == 'xyzzy' and\
          tokens[10] == 'yzzyx' and\
          tokens[12] == 'Skaperen':
           processed += 1
           datadict[tokens[0]] = (tokens[1],tokens[2],tokens[3])
       elif tokens[4] == 'bar' and\
            tokens[5] == 'foo' and\
            tokens[9] == 'yzzyx' and\
            tokens[11] == 'xyzzy' and\
            tokens[12] == 'Skapare':
           processed += 1
           datadict[tokens[0]] = (tokens[2],tokens[1],tokens[3])
       else:
          skipped += 1

comparing long slices is not practical since the data to be checked is on either side of data that can vary.

**buran** · Jun-18-2017, 06:08 AM

in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so?

Skaperen · Jun-18-2017, 07:46 AM

(Jun-18-2017, 06:08 AM)buran Wrote: in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so?

the posted code is an example, not a real case. but it is much like a real case in that some selection rules do involve testing different tokens. the test of tokens[10] vs. the test of tokens[11], while not a real case, was intend to show that test cases can vary like that.

**buran** · Jun-18-2017, 07:49 AM

(Jun-18-2017, 07:46 AM)Skaperen Wrote: but it is much like a real case in that some selection rules do involve testing different tokens.

yeah, that was my question...

**buran** · (This post was last modified: Jun-18-2017, 08:21 AM by buran.)

maybe something like this

datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
           ({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
   tokens = line.rstrip().split()
   for opt, get_tokens in options:
       if all(tokens[k] == value for k, value in opt.items()):
           processed += 1
           datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
           break
       else:
          skipped += 1

note I assume initial value of processed, skipped i set before the snippet you provide. Not sure if you really need processed (you can always check the len of datadict). maybe if you want to compare processed + skipped to total number of records to process?

Skaperen · Jun-19-2017, 02:54 AM

i was thinking of making a function for each type of test then applying them in the order of most likely first. that we the details of each type of test is away from the loop going through all the lines. turns out performance is not very bad. 600000 records only takes a few seconds, and this is not bad if it takes a few minutes (it will ultimately be run at most once an hour).

**buran** · Jun-19-2017, 07:08 AM

(Jun-19-2017, 02:54 AM)Skaperen Wrote: that we the details of each type of test is away from the loop going through all the lines.

Not sure I understand that...
it exit the loop after first match, so not all test are performed.
I also was thinking that if you put the check in a separate function, it can make the code more clear and contained, although maybe a bit longer

By the way - I see I have a mistake in the above code - the else is part of the if statement, but it should be part of the for loop (as was my intention). In the above code the count of skipped is wrong.

datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
           ({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
    tokens = line.rstrip().split()
    for opt, get_tokens in options:
        if all(tokens[k] == value for k, value in opt.items()):
            processed += 1
            datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
            break
    else:
        skipped += 1

Skaperen · Jun-20-2017, 12:16 AM

those counts were just trivial details i tossed in to help show a complete structure. the first ugly real code (more proof of concept) version did not do counts.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	compare parts of a 2-tuple differently	Skaperen	0	856	May-18-2024, 08:52 PM Last Post: Skaperen
	Missing parts of Code	Felipe1991_GVT	3	1,175	Mar-22-2024, 05:58 PM Last Post: deanhystad
	How to expand and collapse individual parts of the code in Atom	Lora	2	1,987	Oct-06-2022, 07:32 AM Last Post: Lora
	Compare two Excel sheets with Python and list diffenrences	dmkfon	1	17,823	Oct-09-2021, 03:30 PM Last Post: Larz60+
	Compare response and name list in experiment	knoxvillerailgrind	3	2,964	Jul-26-2020, 12:23 PM Last Post: deanhystad
	Having a hard time combining two parts of code.	Coozeki	6	4,372	May-10-2020, 06:50 AM Last Post: Coozeki
	Compare Two Lists and Replace Items In a List by Index	nagymusic	2	3,806	May-10-2020, 05:28 AM Last Post: deanhystad
	how to compare a list to a list of lists	kevthew	1	2,386	Dec-22-2019, 11:43 AM Last Post: ibreeden
	Converting parts of a list to int for sorting	menator01	2	2,921	Nov-03-2019, 03:00 PM Last Post: menator01
	Adding adjacent parts of a list	TrueStudentOfPython	1	2,877	Nov-09-2018, 02:40 AM Last Post: ichabod801

looking for sweeter code to compare parts of a list

User Panel Messages

Announcements