Need help for a python script to extract information from a list of files

lephunghien · (This post was last modified: Jun-12-2023, 02:19 AM by lephunghien.)

I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.

I have a folder that contain hundreds of *.pdbqt files named "compound_1.pdbqt", "compound_2.pdbqt", .etc.

Each file have a structure like this:

Output:MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model

I want to use a python (using Python 3) script to exact the "MODEL", "minimizedAffinity", "CNNscore", and "CNNaffinity" of each and every compound in the folder into a delimited text file that look like this:

Output:Compound Model minimizedAffinity CNNscore CNNaffinity 
1 1 -7.11687565 0.573647082 5.82644749
1 2 -6.61898327 0.55260396 5.86855984

Currently I am stuck at this script

#! /usr/bin/env python

import sys
import glob

files = glob.glob('**/*.pdbqt', 
                   recursive = True)
for file in files:
    word1 = 'MODEL'
    word2 = 'minimizedAffinity'
    word3 = 'CNNscore'
    word4 = 'CNNaffinity'
    print(file)
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            if line.find(word1) != -1:
                print('Line:', line)
            if line.find(word2) != -1:
                print('Line:', line)
            if line.find(word3) != -1:
                print('Line:', line)
            if line.find(word4) != -1:
                print('Line:', line)

Really appreciate any help.

Thank you very much.

paul18fr · (This post was last modified: Jun-12-2023, 07:26 AM by paul18fr.)

Maybe the following snippet might help you

it has been supposed here that:

you loop into a list of files to get the compound number (see CurrentFile variable)
the pattern is always the same
the file can be huge, so i prefer to use readline() instead readlines()

Of course it can be improved following your needs

Hope it helps a bit

import re, os
Path = str(os.getcwd())

Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity']

CurrentFile = 'compound_1.pdbqt'
Compound = int(re.search(r"compound_(\d+)", CurrentFile)[1])


with open(Path + '/' + CurrentFile, 'r') as data:
    
    while True:
        Line = data.readline()
        if Line == '': break

        if 'MODEL' in Line:
            Model = int(re.search(r"MODEL\s(\d+)", Line)[1])
            
            Line = data.readline()[:-1]
            minimizedAffinity = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNscore = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNaffinity = float(re.split(" ",Line)[2])
            
            Results.append([Compound, Model, minimizedAffinity, CNNscore, CNNaffinity])
            del Model, minimizedAffinity, CNNscore, CNNaffinity

Output:Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity', 
[1, 1, -7.11687565, 0.573647082, 5.82644749], 
[1, 2, -6.61898327, 0.55260396, 5.86855984]]

**buran** · Jun-12-2023, 08:19 AM

Is it safe to assume that when there is line starting with MODEL, after it there will be always at least 3 lines starting with REMARK for minimizedAffinity, CNNscore and CNNaffinity and always in that order? Then one more line REMARK and then multiple lines, until again next MODEL?

paul18fr · Jun-12-2023, 08:28 AM

@buran: I agree; that's why I've been speaking about the "pattern"

The snipet can be easily adapted, and it must probably be improved if we are dealing with a huge data (append is not the best tool)

**buran** · Jun-12-2023, 11:41 AM

@paul18fr - my question was at OP. I may do things a bit differently from you, but virtually the same approach if this pattern is certain.

**deanhystad** · (This post was last modified: Jun-12-2023, 05:08 PM by deanhystad.)

import re
from io import StringIO


file = StringIO(
"""MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model"""
)
pattern = re.compile(r"REMARK (minimizedAffinity|CNNscore|CNNaffinity) (\S+)")
data = {}
model = None

for line in file:
    if line.startswith("MODEL"):
        model = {}
        data[int(line.split()[1])] = model
    elif model is not None:
        match = re.match(pattern, line)
        if match and len(match.groups()) > 1:
            model[match.group(1)] = float(match.group(2))

print(*data.items(), sep="\n")

Output:(1, {'minimizedAffinity': -7.11687565, 'CNNscore': 0.573647082, 'CNNaffinity': 5.82644749})
(2, {'minimizedAffinity': -6.61898327, 'CNNscore': 0.55260396, 'CNNaffinity': 5.86855984})

***snippsat*** · (This post was last modified: Jun-12-2023, 05:40 PM by snippsat.)

The structure he wants look like tabular structure,i would use Pandas here.
deanhystad way is ok,but is not a tabular structure,then it will a lot of reception of names when files are big.

To show a example of what mean here read a couple test files in current folder,and take the result into Pandas.

from pathlib import Path
import pandas as pd
import re

result = []
path = Path(".")
pattern = re.compile(r'minimizedAffinity\s(-?\d+\.\d+)\n.*CNNscore\s(\d+\.\d+)\n.*CNNaffinity\s(\d+\.\d+)')
for file_path in path.glob("*"):
    with file_path.open('r') as file:
        text = file.read()
    for match in pattern.finditer(text):
        print(match.groups())
        result.append(match.groups())

# Create DataFrame
df = pd.DataFrame(result, columns=['minimizedAffinity', 'CNNscore', 'CNNaffinity'])

Output:('-7.11687565', '0.573647082', '5.82644749')
('-6.61898327', '0.55260396', '5.86855984')
('-7.7777777', '0.000000', '7.8888')
('-6.77777', '0.111111', '5.5555')

If look at the DataFrame now.

>>> df
  minimizedAffinity     CNNscore CNNaffinity
0       -7.11687565  0.573647082  5.82644749
1       -6.61898327   0.55260396  5.86855984
2        -7.7777777     0.000000      7.8888
3          -6.77777     0.111111      5.5555
>>> df.CNNaffinity.max()
'7.8888'

So index could be Model as it will enumerate automatic when read all files.
Also data will be more useful if shall do something with it,as eg find max in CNNaffinity column as shown.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Is it possible to extract 1 or 2 bits of data from MS project files?	cubangt	8	1,070	Feb-16-2024, 12:02 AM Last Post: deanhystad
	Is there a .bat DOS batch script to .py Python Script converter?	pstein	3	3,283	Jun-29-2023, 11:57 AM Last Post: gologica
	script to calculate data in csv-files	ledgreve	0	1,111	May-19-2023, 07:24 AM Last Post: ledgreve
	How do I extract information from this dataset?	SuchUmami	0	712	May-04-2023, 02:41 PM Last Post: SuchUmami
	list the files using query in python	arjunaram	0	679	Mar-28-2023, 02:39 PM Last Post: arjunaram
	Extract value from a list	thesquid	2	865	Nov-29-2022, 01:54 PM Last Post: thesquid
	python extract	mg24	1	955	Nov-02-2022, 06:30 PM Last Post: Larz60+
	SQL Alchemy help to extract sql data into csv files	mg24	1	1,792	Sep-30-2022, 04:43 PM Last Post: Larz60+
	How to download a list of files from FTP?	schnarkle	0	1,016	Jun-21-2022, 10:35 PM Last Post: schnarkle
	Extract parts of multiple log-files and put it in a dataframe	hasiro	4	2,098	Apr-27-2022, 12:44 PM Last Post: hasiro

Need help for a python script to extract information from a list of files

User Panel Messages

Announcements