extracting sublist from a large multiple molecular file

juliocollm · (This post was last modified: May-25-2020, 11:13 AM by Yoriz.)

I am using Phyton 3.7 in last version of Spider

I have a list of about 500 chemical molecule IDs (ten number digits each) which I selected for significant recognition of a protein among 300000 molecule-drug possibilities in one large sdf file of about 2G.

Now, I need to extract the complete 500 sdf records (each one contains about 400 lines between their ID to the final $$$$ line) from the large sdf 2G file and then to rejoin all 500 in one new "best" sdf file, required to fine tune further studies with them.

After googling and consulting during days, I came out to my best unsufficient code.

The code was tested with a txt file containing only 3 IDs (3bestSN.txt) to be selected among an sdf file of only 5 molecules (SNII.5mol.sdf).
However, it did not work as it was expected, and it was very VERY slow !!!!.

Could anybody suggest an alternative solution?
I will really appreciatted it!!!

Thanks, julio

Here is the final code of my best-lousy aproximation:

import sys
#----------------------------------------------------------------DECLARACION DE VARIABLES 
input_f1 = "3bestSN.txt"
input_f2 = "SNII.5mol.sdf"
output_f3 = "best_SN.sdf"
i = 0 # nº SN readed
j = 0  # nº Molecules $$$$ readed
lines1 = [] # mimic of the bestSN ~500 IDs SNxxxxxxxx
lines2 = [] # mimic of a Druglibrary of 300000 molecule IDs SNxxxxxxxx
past_mol = []
found = False
chunk = []

#·----------------------------------------------------------------
f1 = open(input_f1, "r", errors='ignore')    # best SN file to open
#print("Input file1: ", input_f1) 
lines1 = f1.read().split("\n")          # Create a list containing each SN in one line
#print(lines1)
for i in lines1:
    linea_actual = i.rstrip()
    print(linea_actual)
    f1.close() 
    f2 = open(input_f2, "r", errors='ignore')
    f3 = open(output_f3 + '.sdf', 'w')
    #print("Input file2: ", input_f2)    
    for line in f2:
        sline = line.rstrip()
        if sline == linea_actual:
            found = True
            past_mol.append(sline)
        elif sline == '$$$$':
            chunk.append(line)                
            found = False
        if found:
            chunk.append(line)
            f3.write(line)
            print("este es el chunk  ", chunk)
            print("este es el past_mol", past_mol)
f2.close()

jefsummers · May-25-2020, 11:05 AM

Please repost the code using python tags, rather then phyton. The language is named python, after Monty Python, not phyton after part of a plant.
Using the tags preserves indentation which is very important in Python.

juliocollm · (This post was last modified: May-25-2020, 01:11 PM by Yoriz.)

sorry for the h mistake and thank you for the correction........

I was wondering why the code did not look as in the Spyder

IT LOOKS MUCH BETTER NOW!

in case it may help
I copy only part of the response I got in printing the chunk.
The whole response is too long and repetitive.
A similar response but no writing into the f3 file I also got for the past_mol

click this to reveal the chunk of output

este es el chunk ['$$$$\n', 'SN00061212\n', ' MOLSOFT 05232012283D\n', '\n', ' 29 31 0 0 1 0 0 0 0 0999 V2000\n', ' -2.5297 -2.4022 2.4992 C 0 0 0\n', ' -1.8603 -3.4215 1.5206 C 0 0 1\n', ' -2.5952 -4.6872 1.6711 C 0 0 0\n', ' -3.2796 -5.3088 0.6861 C 0 0 0\n', ' -3.2358 -4.8021 -0.7014 C 0 0 1\n', ' -3.0525 -3.2406 -0.7393 C 0 0 1\n', ' -1.6967 -2.9261 0.0084 C 0 0 2\n', ' -0.5549 -3.7391 -0.7170 C 0 0 0\n', ' -0.7971 -5.1562 -0.8270 O 0 0 0\n', ' -2.0569 -5.5631 -1.4200 C 0 0 2\n', ' -2.0751 -5.6523 -2.9172 C 0 0 0\n', ' -3.2273 -6.1186 -3.5968 C 0 0 0\n', ' -3.2244 -6.3264 -4.9890 C 0 0 0\n', ' -2.0595 -6.0874 -5.7506 C 0 0 0\n', ' -0.9108 -5.5949 -5.0902 C 0 0 0\n', ' -0.9223 -5.3978 -3.6942 C 0 0 0\n', ' 0.2491 -5.3726 -5.7884 O 0 0 0\n', ' 0.4796 -3.9888 -6.1095 C 0 0 0\n', ' -1.9315 -6.3428 -7.0948 O 0 0 0\n', ' -2.7915 -6.7823 -8.0650 C 0 0 0\n', ' -4.0152 -6.9626 -7.8665 O 0 0 0\n', ' -2.2362 -7.0252 -9.4057 C 0 0 0\n', ' -1.2894 -1.4088 -0.0913 C 0 0 0\n', ' -0.0575 -1.1356 0.5988 O 0 0 0\n', ' 0.6757 0.0167 0.5273 C 0 0 0\n', ' 0.3414 0.9585 -0.2291 O 0 0 0\n', ' 1.9030 0.1537 1.3244 C 0 0 0\n', ' -3.2514 -2.6187 -2.1571 C 0 0 0\n', ' -3.9800 -6.5586 0.9889 C 0 0 0\n', ' 2 1 1\n', ' 2 7 1\n', ' 2 3 1\n', ' 3 4 2\n', ' 5 4 1\n', ' 4 29 1\n', ' 5 10 1\n', ' 5 6 1\n', ' 6 7 1\n', ' 6 28 1\n', ' 7 8 1\n', ' 7 23 1\n', ' 8 9 1\n', ' 10 9 1\n', ' 10 11 1\n', ' 11 16 2\n', ' 11 12 1\n', ' 12 13 2\n', ' 13 14 1\n', ' 14 15 2\n', ' 14 19 1\n', ' 15 16 1\n', ' 15 17 1\n', ' 17 18 1\n', ' 19 20 1\n', ' 20 21 2\n', ' 20 22 1\n', ' 23 24 1\n', ' 24 25 1\n', ' 25 26 2\n', ' 25 27 1\n', 'M END\n', '> <Molecule Name>\n', 'SN00061212\n', '\n', '> <128>\n', '5.063022e+18\n', '\n', '> <SNID>\n', 'SN00061212\n', '\n', '> <Molecule (InChI Key)>\n', 'VUNBDHKQBFCIFQ-SZXWFAPYSA-N\n', '\n', '> <Aromatic Atoms Count>\n', '6\n', '\n', '> <Aromatic Bonds Count>\n', '6\n', '\n', '> <Bond Count>\n', '31\n', '\n', '> <Hydrogen Bond Acceptors>\n', '4\n', '\n', '> <Hydrogen Bond Donors>\n', '0\n', '\n', '> <Rotatable Bonds Count>\n', '13\n', '\n', '> <Topological Polar Surface Area>\n', '71.06\n', '\n', '> <Molecular Weight>\n', '402.204239\n', '\n', '> <Molecular Formula>\n', 'C23H30O6\n', '\n', '> <Heavy Atoms Count>\n', '29\n', '\n', '> <IUPAC_name>\n', '[(1S.4S.5R.8S.9S)-4-[4-(acetyloxy)-3-methoxyphenyl]-6.8.9-trimethyl-3-oxabicyclo[3.3.1]non-6-en-1-yl]methyl acetate\n', '\n', '> <mass>\n', '402.48\n', '\n', '> <empirical_formula>\n', 'C23H30O6\n', '\n', '> <atomic_composition_%w>\n', 'C (68.64%) - H (7.51%) - O (23.85%)\n', '\n', '> <formal_charge>\n', '0\n', '\n', '> <H_bonds_acceptors>\n', '4\n', '\n', '> <H_bonds_donors>\n', '0\n', '\n', '> <logP>\n', '2.91\n', '\n', '> <topological_PSA>\n', '71.06\n', '\n', '> <molecular_polarizability>\n', '42.83\n', '\n', '> <molecular_polarizability_avg>\n', '44.24\n', '\n', '> <rotatable_bonds>\n', '7\n', '\n', '> <SMILES_format>\n', 'COC1=C(OC©=O)C=CC(=C1)[C@H]1OC[C@]2(COC©=O)[C@@H]©C=C©[C@H]1[C@@H]2C\n', '\n', '> <logD>\n', '0.00\t 2.91\n', ' 1.00\t 2.91\n', ' 2.00\t 2.91\n', ' 3.00\t 2.91\n', ' 4.00\t 2.91\n', ' 5.00\t 2.91\n', ' 6.00\t 2.91\n', ' 7.00\t 2.91\n', ' 8.00\t 2.91\n', ' 9.00\t 2.91\n', ' 10.00\t 2.91\n', ' 11.00\t 2.91\n', ' 12.00\t 2.91\n', ' 13.00\t 2.91\n', ' 14.00\t 2.91\n', '\n', '> <Balaba_index>\n', '1.57\n', '\n', '> <Harary_index>\n', '116.7\n', '\n', '> <hyper_Wiener_index>\n', '8998\n', '\n', '> <Platt_index>\n', '92\n', '\n', '> <Randic_index>\n', '13.69\n', '\n', '> <Szeged_index>\n', '3830\n', '\n', '> <Wiener_index>\n', '2200\n', '\n', '> <Wiener_polarity>\n', '54\n', '\n', '> <vdwaals_surface_area>\n', '622.51\n', '\n', '> <Accessible_Surface_Area>\n', '667.57\n', '\n', '> <ASA_positive>\n', '531.35\n', '\n', '> <ASA_negative>\n', '136.22\n', '\n', '> <ASA_hydrophobic>\n', '566.71\n', '\n', '> <ASA_polar>\n', '100.86\n', '\n', '> <charge_distribution>\n', '0.00\t 0.00\n', ' 1.00\t 0.00\n', ' 2.00\t 0.00\n', ' 3.00\t 0.00\n', ' 4.00\t 0.00\n', ' 5.00\t 0.00\n', ' 6.00\t 0.00\n', ' 7.00\t 0.00\n', ' 8.00\t 0.00\n', ' 9.00\t 0.00\n', ' 10.00\t 0.00\n', ' 11.00\t 0.00\n', ' 12.00\t 0.00\n', ' 13.00\t 0.00\n', ' 14.00\t 0.00\n', '\n', '> <H_bonds_acceptors_multiplicity>\n', '8\n', '\n', '> <H_bonds_donors_multiplicity>\n', '0\n', '\n', '> <H_bonds_accep_avg_multiplicity>\n', '0.00\t 8.00\n', ' 1.00\t 8.00\n', ' 2.00\t 8.00\n', ' 3.00\t 8.00\n', ' 4.00\t 8.00\n', ' 5.00\t 8.00\n', ' 6.00\t 8.00\n', ' 7.00\t 8.00\n', ' 8.00\t 8.00\n', ' 9.00\t 8.00\n', ' 10.00\t 8.00\n', ' 11.00\t 8.00\n', ' 12.00\t 8.00\n', ' 13.00\t 8.00\n', ' 14.00\t 8.00\n', '\n', '> <H_bonds_donor_avg_multiplicity>\n', '0.00\t 0.00\n', ' 1.00\t 0.00\n', ' 2.00\t 0.00\n', ' 3.00\t 0.00\n', ' 4.00\t 0.00\n', ' 5.00\t 0.00\n', ' 6.00\t 0.00\n', ' 7.00\t 0.00\n', ' 8.00\t 0.00\n', ' 9.00\t 0.00\n', ' 10.00\t 0.00\n', ' 11.00\t 0.00\n', ' 12.00\t 0.00\n', ' 13.00\t 0.00\n', ' 14.00\t 0.00\n', '\n', '> <refractivity>\n', '108.6\n', '\n', '> <resonant_structures>\n', '12\n', '\n', '> <number_of_bonds>\n', '61\n', '\n', '> <number_of_double_bonds>\n', '0\n', '\n', '> <aliphatic_bonds>\n', '25\n', '\n', '> <aromatic_bonds>\n', '6\n', '\n', '> <chain_bonds>\n', '15\n', '\n', '> <number_of_ring_bonds>\n', '16\n', '\n', '> <number_of_atoms>\n', '59\n', '\n', '> <aliphatic_atoms>\n', '23\n', '\n', '> <aliphatic_rings>\n', '2\n', '\n', '> <aliphatic_rings_5_atoms>\n', '0\n', '\n', '> <aliphatic_rings_6_atoms>\n', '2\n', '\n', '> <C_aliphatic_rings>\n', '1\n', '\n', '> <fused_aliphatic_rings>\n', '2\n', '\n', '> <hetero_aliphatic_rings>\n', '1\n', '\n', '> <aromatic_atoms>\n', '6\n', '\n', '> <aromatic_rings>\n', '1\n', '\n', '> <aromatic_rings_5_atoms>\n', '0\n', '\n', '> <aromatic_rings_6_atoms>\n', '1\n', '\n', '> <C_aromatic_rings>\n', '1\n', '\n', '> <fused_aromatic_rings>\n', '0\n', '\n', '> <hetero_aromatic_rings>\n', '0\n', '\n', '> <chain_atoms>\n', '14\n', '\n', '> <cyclomatic_num>\n', '3\n', '\n', '> <C_rings>\n', '2\n', '\n', '> <atom_in_ring>\n', '15\n', '\n', '> <number_of_rings>\n', '3\n', '\n', '> <atoms_in_rings>\n', '0;1;1;1;2;2;2;1;1;1;1;1;1;1;1;1;0;0;0;0;0;0;0;0;0;0;0;0;0\n', '\n', '> <rings_5_atoms>\n', '0\n', '\n', '> <rings_6_atoms>\n', '3\n', '\n', '> <ring_systems>\n', '2\n', '\n', '> <ring_systems_5_atoms>\n', '0\n', '\n', '> <ring_systems_6_atoms>\n', '0\n', '\n', '> <fused_rings>\n', '2\n', '\n', '> <hetero_rings>\n', '1\n', '\n', '> <largest_ring_size>\n', '6\n', '\n', '> <largest_ring_system_size>\n', '2\n', '\n', '> <smallest_ring_size>\n', '6\n', '\n', '> <smallest_ring_system_size>\n', '1\n', '\n', '> <asymmetric_atoms>\n', '5\n', '\n', '> <stereogenic_tetrahedral_atoms>\n', '5\n', '\n', '> <maximal_proj_area>\n', '113.32\n', '\n', '> <maximal_proj_radius>\n', '8.67\n', '\n', '> <minimal_proj_area>\n', '62.18\n', '\n', '> <minimal_proj_radius>\n', '6.25\n', '\n', '$$$$\n', '$$$$\n', '$$$$\n', '$$$$\n', '$$$$\n', '$$$$\n', 'SN00134795\n', ' MOLSOFT 05232012283D\n', '\n', ' 9 8 0 0 1 0 0 0 0 0999 V2000\n', ' -0.5798 5.2074 2.1007 C 0 0 0\n', ' -1.6443 4.1066 2.2683 C 0 0 0\n', ' -1.0374 2.8112 2.1307 O 0 0 0\n', ' -1.3709 1.8071 1.2614 C 0 0 0\n', ' -2.3204 1.9459 0.4526 O 0 0 0\n', ' -0.6004 0.5370 1.2892 C 0 0 2\n', ' 0.0641 0.2907 -0.0115 C 0 0 0\n', ' 1.3023 0.1233 -0.0914 O 0 0 0\n', ' -1.4899 -0.6779 1.6621 C 0 0 0\n', ' 1 2 1\n', ' 2 3 1\n', ' 3 4 1\n', ' 4 5 2\n', ' 6 4 1\n', ' 6 9 1\n', ' 6 7 1\n', ' 7 8 2\n', 'M END\n', '> <Molecule Name>\n', 'SN00134795\n', '\n', '> <128>\n', '-2.590631e+18\n', '\n', '> <SNID>\n', 'SN00134795\n', '\n', '> <Molecule (InChI Key)>\n', 'VVCYNVCCODBCOE-YFKPBYRVSA-N\n', '\n', '> <Aromatic Atoms Count>\n', '0\n', '\n', '> <Aromatic Bonds Count>\n', '0\n', '\n', '> <Bond Count>\n', '8\n', '\n', '> <Hydrogen Bond Acceptors>\n', '3\n', '\n', '> <Hydrogen Bond Donors>\n', '0\n', '\n', '> <Rotatable Bonds Count>\n', '6\n', '\n', '> <Topological Polar Surface Area>\n', '43.37\n', '\n', '> <Molecular Weight>\n', '130.062994\n', '\n', '> <Molecular Formula>\n', 'C6H10O3\n', '\n', '> <Heavy Atoms Count>\n', '9\n', '\n', '> <IUPAC_name>\n', 'ethyl (2S)-2-methyl-3-oxopropanoate\n', '\n', '> <mass>\n', '130.14\n', '\n', '> <empirical_formula>\n', 'C6H10O3\n', '\n', '> <atomic_composition_%w>\n', 'C (55.37%) - H (7.74%) - O (36.88%)\n', '\n', '> <formal_charge>\n', '0\n', '\n', '> <H_bonds_acceptors>\n', '2\n', '\n', '> <H_bonds_donors>\n', '0\n', '\n', '> <logP>\n', '0.56\n', '\n', '> <topological_PSA>\n', '43.37\n', '\n', '> <molecular_polarizability>\n', '12.68\n', '\n', '> <molecular_polarizability_avg>\n', '13.16\n', '\n', '> <rotatable_bonds>\n', '4\n', '\n', '> <SMILES_format>\n', 'CCOC(=O)[C@@H]©C=O\n', '\n', '> <logD>\n', '0.00\t 0.56\n', ' 1.00\t 0.56\n', ' 2.00\t 0.56\n', ' 3.00\t 0.56\n', ' 4.00\t 0.56\n', ' 5.00\t 0.56\n', ' 6.00\t 0.56\n', ' 7.00\t 0.55\n', ' 8.00\t 0.55\n', ' 9.00\t 0.46\n', ' 10.00\t 0.02\n', ' 11.00\t -0.78\n', ' 12.00\t -1.37\n', ' 13.00\t -1.52\n', ' 14.00\t -1.54\n', '\n', '> <Balaba_index>\n', '3.32\n', '\n', '> <Harary_index>\n', '17.77\n', '\n', '> <hyper_Wiener_index>\n', '215\n', '\n', '> <Platt_index>\n', '1..............................

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Reading large crapy text file in anaconda to profile data	syamatunuguntla	0	1,373	Nov-18-2022, 06:15 PM Last Post: syamatunuguntla
	Chunking and Sorting a large file	Robotguy	1	4,563	Jul-29-2020, 12:48 AM Last Post: Larz60+
	XLSX file with multiple sheets to josn file	ovidius	2	3,179	Apr-05-2020, 09:22 AM Last Post: ovidius
	Sorting a large CVS file	DavidTheGrockle	1	2,678	Oct-31-2019, 12:32 PM Last Post: ichabod801
	large csv to many xlsx containing multiple tabs	thatIsTheCase	3	5,565	Nov-27-2018, 02:34 PM Last Post: thatIsTheCase
	Searching a .txt file for a specific number and extracting the corresponding data	nrozman	3	3,980	Jul-27-2018, 02:07 PM Last Post: nrozman
	How to filter specific rows from large data file	Ariane	7	9,682	Jun-29-2018, 02:43 PM Last Post: gontajones
	access a very large file? As an array or as a dataframe?	Angelika	5	6,109	May-18-2017, 08:15 AM Last Post: Angelika

extracting sublist from a large multiple molecular file

User Panel Messages

Announcements