Posts: 4,647
Threads: 1,494
Joined: Sep 2016
Jun-18-2017, 05:21 AM
(This post was last modified: Jun-18-2017, 05:21 AM by Skaperen.)
i have a big (nearly a million lines) file that is being read in, one line at a time. each line is .split() and there are about 3 dozen tokens for each line. several tokens are checked to select which lines are to be used. the check is an equality check for a few different tokens. the tokens being checked are not contiguous, such as 4, 5, 9, 10, and 12 are checked. i am doing these checks with a big long if statement with lots of ands. i am wondering if there is any sweeter way to code this kind of thing. a program i am working on today is doing a lot of this kind of thing, from lots of cloud data i have.
datadict = {}
for line in sys.stdin:
tokens = line.rstrip().split()
if tokens[4] == 'foo' and\
tokens[5] == 'bar' and\
tokens[9] == 'xyzzy' and\
tokens[10] == 'yzzyx' and\
tokens[12] == 'Skaperen':
processed += 1
datadict[tokens[0]] = (tokens[1],tokens[2],tokens[3])
elif tokens[4] == 'bar' and\
tokens[5] == 'foo' and\
tokens[9] == 'yzzyx' and\
tokens[11] == 'xyzzy' and\
tokens[12] == 'Skapare':
processed += 1
datadict[tokens[0]] = (tokens[2],tokens[1],tokens[3])
else:
skipped += 1 comparing long slices is not practical since the data to be checked is on either side of data that can vary.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 8,160
Threads: 160
Joined: Sep 2016
in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so?
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
(Jun-18-2017, 06:08 AM)buran Wrote: in line 7, you check item with index=10 and 0n line 14 - the one with index 11, is that really so? the posted code is an example, not a real case. but it is much like a real case in that some selection rules do involve testing different tokens. the test of tokens[10] vs. the test of tokens[11], while not a real case, was intend to show that test cases can vary like that.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 8,160
Threads: 160
Joined: Sep 2016
(Jun-18-2017, 07:46 AM)Skaperen Wrote: but it is much like a real case in that some selection rules do involve testing different tokens. yeah, that was my question...
Posts: 8,160
Threads: 160
Joined: Sep 2016
Jun-18-2017, 08:17 AM
(This post was last modified: Jun-18-2017, 08:21 AM by buran.)
maybe something like this
datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
tokens = line.rstrip().split()
for opt, get_tokens in options:
if all(tokens[k] == value for k, value in opt.items()):
processed += 1
datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
break
else:
skipped += 1 note I assume initial value of processed, skipped i set before the snippet you provide. Not sure if you really need processed (you can always check the len of datadict). maybe if you want to compare processed + skipped to total number of records to process?
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
i was thinking of making a function for each type of test then applying them in the order of most likely first. that we the details of each type of test is away from the loop going through all the lines. turns out performance is not very bad. 600000 records only takes a few seconds, and this is not bad if it takes a few minutes (it will ultimately be run at most once an hour).
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 8,160
Threads: 160
Joined: Sep 2016
(Jun-19-2017, 02:54 AM)Skaperen Wrote: that we the details of each type of test is away from the loop going through all the lines. Not sure I understand that...
it exit the loop after first match, so not all test are performed.
I also was thinking that if you put the check in a separate function, it can make the code more clear and contained, although maybe a bit longer
By the way - I see I have a mistake in the above code - the else is part of the if statement, but it should be part of the for loop (as was my intention). In the above code the count of skipped is wrong.
datadict = {}
options = [({4:'foo', 5:'bar', 9:'xyzzy', 10:'yzzyx', 14:'Skaperen'}, (1, 2, 3)),
({4:'bar', 5:'foo', 9:'yzzyx', 11:'xyzzy', 14:'Skaperen'}, (2, 1, 3))]
for line in sys.stdin:
tokens = line.rstrip().split()
for opt, get_tokens in options:
if all(tokens[k] == value for k, value in opt.items()):
processed += 1
datadict[tokens[0]] = tuple((tokens[i] for i in get_tokens))
break
else:
skipped += 1
Posts: 4,647
Threads: 1,494
Joined: Sep 2016
those counts were just trivial details i tossed in to help show a complete structure. the first ugly real code (more proof of concept) version did not do counts.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
|