I am using Phyton 3.7 in last version of Spider
I have a list of about 500 chemical molecule IDs (ten number digits each) which I selected for significant recognition of a protein among 300000 molecule-drug possibilities in one large sdf file of about 2G.
Now, I need to extract the complete 500 sdf records (each one contains about 400 lines between their ID to the final $$$$ line) from the large sdf 2G file and then to rejoin all 500 in one new "best" sdf file, required to fine tune further studies with them.
After googling and consulting during days, I came out to my best unsufficient code.
The code was tested with a txt file containing only 3 IDs (3bestSN.txt) to be selected among an sdf file of only 5 molecules (SNII.5mol.sdf).
However, it did not work as it was expected, and it was very VERY slow !!!!.
Could anybody suggest an alternative solution?
I will really appreciatted it!!!
Thanks, julio
Here is the final code of my best-lousy aproximation:
I have a list of about 500 chemical molecule IDs (ten number digits each) which I selected for significant recognition of a protein among 300000 molecule-drug possibilities in one large sdf file of about 2G.
Now, I need to extract the complete 500 sdf records (each one contains about 400 lines between their ID to the final $$$$ line) from the large sdf 2G file and then to rejoin all 500 in one new "best" sdf file, required to fine tune further studies with them.
After googling and consulting during days, I came out to my best unsufficient code.
The code was tested with a txt file containing only 3 IDs (3bestSN.txt) to be selected among an sdf file of only 5 molecules (SNII.5mol.sdf).
However, it did not work as it was expected, and it was very VERY slow !!!!.
Could anybody suggest an alternative solution?
I will really appreciatted it!!!
Thanks, julio
Here is the final code of my best-lousy aproximation:
import sys #----------------------------------------------------------------DECLARACION DE VARIABLES input_f1 = "3bestSN.txt" input_f2 = "SNII.5mol.sdf" output_f3 = "best_SN.sdf" i = 0 # nº SN readed j = 0 # nº Molecules $$$$ readed lines1 = [] # mimic of the bestSN ~500 IDs SNxxxxxxxx lines2 = [] # mimic of a Druglibrary of 300000 molecule IDs SNxxxxxxxx past_mol = [] found = False chunk = [] #·---------------------------------------------------------------- f1 = open(input_f1, "r", errors='ignore') # best SN file to open #print("Input file1: ", input_f1) lines1 = f1.read().split("\n") # Create a list containing each SN in one line #print(lines1) for i in lines1: linea_actual = i.rstrip() print(linea_actual) f1.close() f2 = open(input_f2, "r", errors='ignore') f3 = open(output_f3 + '.sdf', 'w') #print("Input file2: ", input_f2) for line in f2: sline = line.rstrip() if sline == linea_actual: found = True past_mol.append(sline) elif sline == '$$$$': chunk.append(line) found = False if found: chunk.append(line) f3.write(line) print("este es el chunk ", chunk) print("este es el past_mol", past_mol) f2.close()