Python Forum

Hi all,

In these days I wrote a script. It works very well but a piece of his code is very slow because of a custom function. you can see it below:

def split_rules(rules, *index):
    rule = ""
    if len(index) == 1:
        rules_splitted = rules.split("\n")
        first_rule_line = rules_splitted[0]
        index_0 = index[0]
        rule_name = first_rule_line.split()[index_0]
        for line in rules_splitted:
            list = line.split()
            if rule_name == list[index_0]:
                rule = rule + line + "\n"
            else:
                rule_name = line.split()[index_0]
                rule = rule + "\n" + line + "\n"
        rules = rule.rstrip("\n")   
    elif len(index) == 2:
        rules_splitted = rules.split("\n")
        first_rule_line = rules_splitted[0]
        first_rule_line_splitted = first_rule_line.split()
        index_0 = index[0]
        index_1 = index[1]
        rule_name_1 = first_rule_line_splitted[index_0]
        rule_name_2 = first_rule_line_splitted[index_1]
        for line in rules_splitted:
            list = line.split()
            if rule_name_1 == list[index_0] and rule_name_2 == list[index_1]:
                rule = rule + line + "\n"
            else:
                rule_name_1 = list[index_0]
                rule_name_2 = list[index_1]
                rule = rule + "\n" + line + "\n"
        rules = rule.rstrip("\n")    
    elif len(index) == 3:
        rules_splitted = rules.split("\n")
        first_rule_line = rules_splitted[0]
        first_rule_line_splitted = first_rule_line.split()
        index_0 = index[0]
        index_1 = index[1]
        index_2 = index[2] 
        rule_name_1 = first_rule_line_splitted[index_0]
        rule_name_2 = first_rule_line_splitted[index_1]
        rule_name_3 = first_rule_line_splitted[index_2]
        for line in rules_splitted:
            list = line.split()
            if rule_name_1 == list[index_0] and rule_name_2 == list[index_1] and rule_name_3 == list[index_2]: 
                rule = rule + line + "\n"
            else:
                rule_name_1 = list[index_0]
                rule_name_2 = list[index_1]
                rule_name_3 = list[index_2]
                rule = rule + "\n" + line + "\n"
        rules = rule.rstrip("\n")   
    return(rules)

the "split_rules" function splits a text file containing many policies. what do I mean with policies? I mean anything like that:

Quote:from-zone A to-zone B policy CICCIO ...
from-zone A to-zone B policy CICCIO ...
from-zone A to-zone B policy PIPPO ...
from-zone X to-zone Y policy PIPPO ...
from-zone X to-zone Y policy PIPPO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PAPERINO ...
from-zone Z to-zone Z policy PAPERINO ...
from-zone Z to-zone Z policy CICCIO ...
from-zone Z to-zone Z policy CICCIO ...

this text file is composed by five policies (you can identify them keeping in mind three parameters: source zone, destination zone and policy name). my custom function can split them with a duble new line and save them in a new variable. it's the result:

text = open("test.txt", "r")
text = text.read()

print(split_rules(text, 1, 3, 5))

# print output:
#
# from-zone A to-zone B policy CICCIO ...
# from-zone A to-zone B policy CICCIO ...
# 
# from-zone A to-zone B policy PIPPO ...
# 
# from-zone X to-zone Y policy PIPPO ...
# from-zone X to-zone Y policy PIPPO ...
# 
# from-zone Z to-zone Z policy PLUTO ...
# from-zone Z to-zone Z policy PLUTO ...
# from-zone Z to-zone Z policy PLUTO ...
# 
# from-zone Z to-zone Z policy PAPERINO ...
# from-zone Z to-zone Z policy PAPERINO ...
# 
# from-zone Z to-zone Z policy CICCIO ...
# from-zone Z to-zone Z policy CICCIO ...

this function works, it reaches his goal very well but it seems that it is not enough efficent when it have to process more than 40000 lines. with 40000 lines it takes around 2 minutes to process the policies, but with 150000 lines it takes around 15 minutes!

with many lines it's not scalable. how can I improve my custom function to become it scalable and more efficient than before?

Among the many things that can be sped up, the most blatant one is the construction of the output string by the statements

rule = ''
...
rule = rule + morerule
...
return rule

Instead of this, you could append lines to a list

outrule = []
...
outrule.append(some_line)
...
return "\n".join(outrule)

in addition to @Gribouillis' advise, to simplify your code:
create a function that parse a single line. make it return a tuple. compare tuples, not single elements.

def parse_line(line, items):
    split_line = line.split()
    return tuple(split_line[idx] for idx in items)

def split_rules(rules, *items):
    result = []
    prev_rule = None
    for line in rules.split('\n'):
        rule = parse_line(line, items)
        if prev_rule and rule != prev_rule:
            result.append('')
        prev_rule = rule
        result.append(line)
    return result

rules = """from-zone A to-zone B policy CICCIO ...
from-zone A to-zone B policy CICCIO ...
from-zone A to-zone B policy PIPPO ...
from-zone X to-zone Y policy PIPPO ...
from-zone X to-zone Y policy PIPPO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PAPERINO ...
from-zone Z to-zone Z policy PAPERINO ...
from-zone Z to-zone Z policy CICCIO ...
from-zone Z to-zone Z policy CICCIO ..."""

print('\n'.join(split_rules(rules, 1, 3, 5)))

Output:from-zone A to-zone B policy CICCIO ...
from-zone A to-zone B policy CICCIO ...

from-zone A to-zone B policy PIPPO ...

from-zone X to-zone Y policy PIPPO ...
from-zone X to-zone Y policy PIPPO ...

from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...
from-zone Z to-zone Z policy PLUTO ...

from-zone Z to-zone Z policy PAPERINO ...
from-zone Z to-zone Z policy PAPERINO ...

from-zone Z to-zone Z policy CICCIO ...
from-zone Z to-zone Z policy CICCIO ...

In addition to buran's code, you could consider using itertools.groupby() to group lines together. Also the function could be made more versatile by accepting file handles for input and output instead of strings. Alternately, it could take an iterable as input and generate lines of output with yield statements.

aquerci

Gribouillis

buran

Gribouillis