Python Forum

Full Version: Optimizing a code
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everyone,

I create an algorithm that will give me the path of the shortest interactions between 2 proteins, I explain:

I have a first file that contains the pairs of proteins to test (ex):

prot1 prot2 (the string will start with prot1 and end with prot2)

prot3 prot4

ect ..

My second file is a binary interactions file between different proteins, eg:

prot15 prot20

prot3 prot1

prot100 prot21

prot1 prot16

prot16 prot2

ect ...

for example here the shortest path between prot1 and prot2 is: prot1 prot16 prot2

Here is my code:

import sys
sys.setrecursionlimit(1000000)
from collections import defaultdict
dd = defaultdict(set)
P=[]
L=[]
 
#File containing the test proteins, I put all the couples in a list
with open("C:/Users/lveillat/Desktop/Chaines minimales/3080_interactions_pour_736.txt","r") as f0:
    for lignes0 in f0:
        P.append(lignes0.rstrip("\n").split(" ")[:2])

 
#File containing all the interactions, I make a dictionary with a protein key and value all the proteins with which it interacts
with open("736(int_connues_avec_scores_sup_0).txt","r") as f1:
    for lignes in f1:
        if lignes.startswith('9606'):
            lignes=lignes.rstrip('\n').split(" ")
            prot1=lignes[2]
            prot2=lignes[3]
            dd[prot1].add(prot2)
    #print(dd)
 
#Function allowing me to build interchange channels
def chain(proteine1, proteine2, maillon, pathway, limite=10):
    next_= maillon.get(pathway[-1],None)
 
    if len(pathway) < limite  :
        for m in next_:
            if m not in pathway:
 
                if m != proteine2:
                    yield from chain(proteine1, proteine2, maillon, pathway + [m])
 
                elif pathway[0]==proteine1::
                        pathway.append(m)
                        yield pathway
 
#For my protein couples to study
for c in P:
    #print(c)
    proteine1=c[0]
    proteine2=c[1]
    L.clear()
    print("The first protein in the chain is", proteine1)
    print("The last protein in the chain is", proteine2)
    print("")
    print("Minimal size chain(s):")
    print("")
 
    #I put in a list all possible chains of interactions starting with protein1 and ending with protein2
    for k in dd:
        for z in chain(proteine1,proteine2, dd, pathway = [k]):
            if z not in L:
                L.append(z)

 
    # I display the smallest elements of the list = the shortest chains
    
            min_len=min(len(z) for z in L)
            mins=[z for z in L if len(z) == min_len]

    for chaine in mins: 
        if chaine[-1]==proteine2:
            print(' '.join(chaine))
        else:
            print('ERROR - NO CHAINS BELOW THE LIMIT', )
 
    print("")
    print("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
    print("")
The problem is that it is extremely slow ... are there lines that you think can be optimized and save me computing time?

Thank you :)
It looks like a well known shortest path problem for which good algorithms are known and available in python, such as Dijkstra's algorithm. Have you done some research in this direction? Graph related python modules such as networkx have functions that compute the shortest path directly. Functions also exist in scipy. Why not use them?
Following is based on opening statement: "I create an algorithm that will give me the path of the shortest interactions between 2 proteins".

I would read data from file into dictionary or defaultdict so that key is node and values are edges.

With dictionary:

data = dict()                                                              
with open('proteins.txt', 'r') as f: 
    for row in f: 
        k, v = row.strip().split() 
        data.setdefault(k, set()).add(v)
Or with defauldict:

data = defaultdict(set)
with open('proteins.txt', 'r') as f: 
    for row in f: 
        k, v = row.strip().split() 
        data[k].add(v)
Now when I have data nicely structured I implement shortest path algorithm. You can find one from python. org: Python Patterns - Implementing Graphs.
(Jul-11-2019, 03:48 PM)Gribouillis Wrote: [ -> ]It looks like a well known shortest path problem for which good algorithms are known and available in python, such as Dijkstra's algorithm. Have you done some research in this direction? Graph related python modules such as networkx have functions that compute the shortest path directly. Functions also exist in scipy. Why not use them?

Yes indeed I had seen this module networkx, however being a beginner in python, I absolutely did not understand how it worked ... I do not even see how to generate my tree (all combinations of possible proteins) to the operate with networkx ...
Here is a link to a page with simple networkx operations, including shortest path.