Python Forum

Full Version: looking for sweeter code to compare parts of a list
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
i have a big (nearly a million lines) file that is being read in, one line at a time.  each line is .split() and there are about 3 dozen tokens for each line.  several tokens are checked to select which lines are to be used.  the check is an equality check for a few different tokens.  the tokens being checked are not contiguous, such as 4, 5, 9, 10, and 12 are checked.  i am doing these checks with a big long if statement with lots of ands.  i am wondering if there is any sweeter way to code this kind of thing. a program i am working on today is doing a lot of this kind of thing, from lots of cloud data i have.

    datadict = {}
    for line in sys.stdin:
       tokens = line.rstrip().split()
       if tokens[4] == 'foo' and\
          tokens[5] == 'bar' and\
          tokens[9] == 'xyzzy' and\
          tokens[10] == 'yzzyx' and\
          tokens[12] == 'Skaperen':
           processed += 1
           datadict[tokens[0]] = (tokens[1],tokens[2],tokens[3])
       elif tokens[4] == 'bar' and\
            tokens[5] == 'foo' and\
            tokens[9] == 'yzzyx' and\
            tokens[11] == 'xyzzy' and\
            tokens[12] == 'Skapare':
           processed += 1
           datadict[tokens[0]] = (tokens[2],tokens[1],tokens[3])
       else:
          skipped += 1
comparing long slices is not practical since the data to be checked is on either side of data that can vary.
in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so?
(Jun-18-2017, 06:08 AM)buran Wrote: [ -> ]in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so?
the posted code is an example, not a real case.  but it is much like a real case in that some selection rules do involve testing different tokens.  the test of tokens[10] vs. the test of tokens[11], while not a real case, was intend to show that test cases can vary like that.
(Jun-18-2017, 07:46 AM)Skaperen Wrote: [ -> ]but it is much like a real case in that some selection rules do involve testing different tokens.
yeah, that was my question...
maybe something like this

datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
           ({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
   tokens = line.rstrip().split()
   for opt, get_tokens in options:
       if all(tokens[k] == value for k, value in opt.items()):
           processed += 1
           datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
           break
       else:
          skipped += 1
note I assume initial value  of processed, skipped i set before the snippet you provide. Not sure if you really need processed (you can always check the len of datadict). maybe if you want to compare processed + skipped to total number of records to process?
i was thinking of making a function for each type of test then applying them in the order of most likely first. that we the details of each type of test is away from the loop going through all the lines. turns out performance is not very bad. 600000 records only takes a few seconds, and this is not bad if it takes a few minutes (it will ultimately be run at most once an hour).
(Jun-19-2017, 02:54 AM)Skaperen Wrote: [ -> ]that we the details of each type of test is away from the loop going through all the lines.
Not sure I understand that...
it exit the loop after first match, so not all test are performed.
I also was thinking that if you put the check in a separate function, it can make the code more clear and contained, although maybe a bit longer
By the way - I see I have a mistake in the above code - the else is part of the if statement, but it should be part of the for loop (as was my intention). In the above code the count of skipped is wrong.

datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
           ({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
    tokens = line.rstrip().split()
    for opt, get_tokens in options:
        if all(tokens[k] == value for k, value in opt.items()):
            processed += 1
            datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
            break
    else:
        skipped += 1
those counts were just trivial details i tossed in to help show a complete structure. the first ugly real code (more proof of concept) version did not do counts.