How can I make this as computationally efficient as possible?

IAMK · (This post was last modified: Apr-15-2018, 02:07 AM by IAMK.)

For work, I am cycling through ~24 billion combinations, and checking 3 things on each combination. Hence, I need my currently working code to be made as computationally efficient as possible so that this can be run as quickly as possible. p.s. I decided to run it just to see how slow it is. It's done roughly 180x3 million combinations in ~20 minutes (didn't record start time). Honestly, this is MUCH faster than I expected... but it's still too slow. The code will run on windows 10 in cmd with Python 3, if that matter at all.

However, as an example, I have made it 243x3.
Basically, what the script does is cycle through each combination in sa_data[][], then print the array positions and which line matches. It checks if the same word appears 5x in one row. Think of it as a 3-row, 5-column table connect-5 game (https://en.wikipedia.org/wiki/Connect_Four).

I can't be bothered changing the names of all my variables, so if there is anything you don't understand, please ask. The general way I name my variables are type_name, e.g. ia_lines = int array of lines. s would be string, t would be temp. Note: I call my lists "array"s (probably bad practice in python).

Also, if any clarification is required on what I'm doing in the code, I can add a comment to help explain it.

The main ones which need to be efficient are getNum() and checkWin().

Code:

ia_lines = [[0,0,0,0,0], [-1,-1,-1,-1,-1], [1,1,1,1,1]] #[0] checks current line in array, [1] checks previous line in array, [2] checks next line in array.

sa_data = []

for x in range(5):
	sa_data.append([])
	
def getNum(x, y, i):
	if y + ia_lines[i][x] == len(sa_data[x]):
		return 0
	else:
		return y + ia_lines[i][x]

def checkWin():
	#fout = open('Log.txt', 'w+')
	#read comment below about print() and fout.write()
	for a in range(len(sa_data[0])):
		for b in range(len(sa_data[1])):
			for c in range(len(sa_data[2])):
				for d in range(len(sa_data[3])):
					for e in range(len(sa_data[4])):
						for i in range(len(ia_lines)):
							if sa_data[0][getNum(0, a, i)] == sa_data[1][getNum(1, b, i)] == sa_data[2][getNum(2, c, i)] == sa_data[3][getNum(3, d, i)] == sa_data[4][getNum(4, e, i)]:
								print(str(a) + "," + str(b) + "," + str(c) + "," + str(d) + "," + str(e) + " on line " + str(i + 1) + " with 5 " + sa_data[0][getNum(0, a, i)])
								#fout.write(str(a) + "," + str(b) + "," + str(c) + "," + str(d) + "," + str(e) + " on line " + str(i + 1) + " with 5 " + sa_data[0][getNum(0, a, i)] + '\n')
								#I will replace print() with fout.write() after the script finishes running (I don't want to make changes just in case it affects anything).
	#fout.close()

with open('Text.txt', 'r') as fin:
	ti_reelNum = 0
	for s_Line in fin:
		s_Line = s_Line.strip().split('"')
		if len(s_Line) < 2:
			continue
		elif s_Line[1][0] == 'R':
			ti_reelNum += 1
		else:
			sa_data[ti_reelNum - 1].append(s_Line[1])

checkWin()

Text.txt:

<tag1>
    <t1 name = "R 1">
      <tag3 name="korea"/>
      <tag4 name="china"/>
      <tag5 name="japan"/>
    </t1>
    <t2 name = "R 2">
      <tag3 name="japan"/>
      <tag4 name="china"/>
      <tag5 name="korea"/>
    </t2>
    <t3 name = "R 3">
      <tag3 name="china"/>
      <tag4 name="korea"/>
      <tag5 name="japan"/>
    </t3>
    <t4 name = "R 4">
      <tag3 name="china"/>
      <tag4 name="japan"/>
      <tag5 name="korea"/>
    </t4>
    <t5 name = "R 5">
      <tag3 name="korea"/>
      <tag4 name="japan"/>
      <tag5 name="china"/>
    </t5>
</tag1>

Thank you in advance.

wavic · (This post was last modified: Apr-15-2018, 07:28 AM by wavic.)

You can pass sa_data and ia_lines as parameters to get_num and checkWin.
Python checks first the local variables and if the variable is no there checks the global space. It's good to be local.

You can use lxml to parse the document and make the code more readable. I don't know how efficient will be.

import lxml

with open('Text.txt', 'r') as fin:
    tree = lxml.etree.parse(fin)

    ti_reelNum = 0
    for tag in tree.iterfind(".//*[@name]":
        print(tag.attrib['name']) # just for test
        # your code here

I think scipy can handle in much more efficient way such data structure as sa_data but I never used it so can't help here.
These nested for loops... Undecided

IAMK · Apr-15-2018, 07:36 AM

(Apr-15-2018, 07:27 AM)wavic Wrote: You can pass sa_data and ia_lines as parameters to get_num and checkWin. Python checks first the local variables and if the variable is no there checks the global space.

Thanks. I will do that.

I'm not too fussed about readability. I just need to process billions of checks AQAP.

**Gribouillis** · (This post was last modified: Apr-15-2018, 03:41 PM by Gribouillis.)

I think you can speed up things by generating the individual matching lines first. The following code prints a record for each word appearing in the five columns. Each record contains five lists of all the indexes where the word appears in the columns of sa_data

from collections import defaultdict, namedtuple

def word_idx_dict(seq):
    di = defaultdict(list)
    for i, k in enumerate(seq):
        di[k].append(i)
    return di

def line_matches(sa_data):
    dics  = [word_idx_dict(seq) for seq in sa_data]
    # get words appearing in all the columns
    s = set(dics[0])
    for d in dics[1:]:
        s &= set(d)
    words = sorted(s)
    Record = namedtuple('Record', 'word rownums')
    for w in words:
        yield Record(word=w, rownums=[d[w] for d in dics])

for rec in line_matches(sa_data):
    print(rec)

I think this sequence of records is fast to generate and it should be a better starting point than the raw sa_data array.

wavic · Apr-15-2018, 10:08 AM

Try it. I can't tell whether it will be faster or not. Just don't have the data and I can't bother now to generate one. lxml is C library and it's fast. I am using it with BeautifulSoup for quick parsing web pages. Text.txt looks like xml data.

Have no time to look at the other code right now. I am too tired. Bad sleep.
What do you want to accomplish?

IAMK · Apr-16-2018, 12:10 PM

(Apr-15-2018, 10:05 AM)Gribouillis Wrote: I think this sequence of records is fast to generate and it should be a better starting point than the raw sa_data array

Yes, I watched Ned's video today, and it had enumerations and dictionaries in it. Thanks.

(Apr-15-2018, 10:08 AM)wavic Wrote: I am using it with BeautifulSoup for quick parsing web pages.

The parsing is not an issue. It takes a second and is done only once.

I have optimised the code a bit, but I am not writing a c++ version. If the c++ version is much faster, I will stick with it. Else, I will come back and further refine my Python (I may even do so just for fun when I have time).

By the way, I read some stackoverflow yesterday about local, global and class variables. Could someone please try to explain how a class variable is THAT much more efficient than a local/global variable?

IAMK · Apr-17-2018, 11:15 AM

Uhh... so after some optimisation, I took my 27-hour runtime down to 5 minutes... for 25 billion cycles...
I still haven't used dictionaries or multithreading...
And to think that c++ could be 10x or more faster than this...

wavic · Apr-17-2018, 07:48 PM

Try Nuitka first.

**nilamo** · Apr-17-2018, 08:24 PM

(Apr-17-2018, 11:15 AM)IAMK Wrote: Uhh... so after some optimisation, I took my 27-hour runtime down to 5 minutes... for 25 billion cycles...
I still haven't used dictionaries or multithreading...
And to think that c++ could be 10x or more faster than this...

So is that good, or are we still making it faster? What's the current code look like?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	BS4 - Is There A More Efficient Way Of Doing This?	digitalmatic7	4	6,083	Nov-28-2017, 11:33 AM Last Post: snippsat

How can I make this as computationally efficient as possible?

User Panel Messages

Announcements