Python Forum

Full Version: Need help for a python script to extract information from a list of files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.

I have a folder that contain hundreds of *.pdbqt files named "compound_1.pdbqt", "compound_2.pdbqt", .etc.

Each file have a structure like this:

Output:
MODEL 1 REMARK minimizedAffinity -7.11687565 REMARK CNNscore 0.573647082 REMARK CNNaffinity 5.82644749 REMARK 11 active torsions: #Lots of text here MODEL 2 REMARK minimizedAffinity -6.61898327 REMARK CNNscore 0.55260396 REMARK CNNaffinity 5.86855984 REMARK 11 active torsions: #Lots of text here #Repeat with 10 to 20 model
I want to use a python (using Python 3) script to exact the "MODEL", "minimizedAffinity", "CNNscore", and "CNNaffinity" of each and every compound in the folder into a delimited text file that look like this:

Output:
Compound Model minimizedAffinity CNNscore CNNaffinity 1 1 -7.11687565 0.573647082 5.82644749 1 2 -6.61898327 0.55260396 5.86855984
Currently I am stuck at this script

#! /usr/bin/env python

import sys
import glob

files = glob.glob('**/*.pdbqt', 
                   recursive = True)
for file in files:
    word1 = 'MODEL'
    word2 = 'minimizedAffinity'
    word3 = 'CNNscore'
    word4 = 'CNNaffinity'
    print(file)
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            if line.find(word1) != -1:
                print('Line:', line)
            if line.find(word2) != -1:
                print('Line:', line)
            if line.find(word3) != -1:
                print('Line:', line)
            if line.find(word4) != -1:
                print('Line:', line)
Really appreciate any help.

Thank you very much.
Maybe the following snippet might help you

it has been supposed here that:
  • you loop into a list of files to get the compound number (see CurrentFile variable)
  • the pattern is always the same
  • the file can be huge, so i prefer to use readline() instead readlines()

Of course it can be improved following your needs

Hope it helps a bit

import re, os
Path = str(os.getcwd())

Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity']

CurrentFile = 'compound_1.pdbqt'
Compound = int(re.search(r"compound_(\d+)", CurrentFile)[1])


with open(Path + '/' + CurrentFile, 'r') as data:
    
    while True:
        Line = data.readline()
        if Line == '': break

        if 'MODEL' in Line:
            Model = int(re.search(r"MODEL\s(\d+)", Line)[1])
            
            Line = data.readline()[:-1]
            minimizedAffinity = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNscore = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNaffinity = float(re.split(" ",Line)[2])
            
            Results.append([Compound, Model, minimizedAffinity, CNNscore, CNNaffinity])
            del Model, minimizedAffinity, CNNscore, CNNaffinity
Output:
Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity', [1, 1, -7.11687565, 0.573647082, 5.82644749], [1, 2, -6.61898327, 0.55260396, 5.86855984]]
Is it safe to assume that when there is line starting with MODEL, after it there will be always at least 3 lines starting with REMARK for minimizedAffinity, CNNscore and CNNaffinity and always in that order? Then one more line REMARK and then multiple lines, until again next MODEL?
@buran: I agree; that's why I've been speaking about the "pattern"

The snipet can be easily adapted, and it must probably be improved if we are dealing with a huge data (append is not the best tool)
@paul18fr - my question was at OP. I may do things a bit differently from you, but virtually the same approach if this pattern is certain.
import re
from io import StringIO


file = StringIO(
"""MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model"""
)
pattern = re.compile(r"REMARK (minimizedAffinity|CNNscore|CNNaffinity) (\S+)")
data = {}
model = None

for line in file:
    if line.startswith("MODEL"):
        model = {}
        data[int(line.split()[1])] = model
    elif model is not None:
        match = re.match(pattern, line)
        if match and len(match.groups()) > 1:
            model[match.group(1)] = float(match.group(2))

print(*data.items(), sep="\n")
Output:
(1, {'minimizedAffinity': -7.11687565, 'CNNscore': 0.573647082, 'CNNaffinity': 5.82644749}) (2, {'minimizedAffinity': -6.61898327, 'CNNscore': 0.55260396, 'CNNaffinity': 5.86855984})
The structure he wants look like tabular structure,i would use Pandas here.
deanhystad way is ok,but is not a tabular structure,then it will a lot of reception of names when files are big.

To show a example of what mean here read a couple test files in current folder,and take the result into Pandas.
from pathlib import Path
import pandas as pd
import re

result = []
path = Path(".")
pattern = re.compile(r'minimizedAffinity\s(-?\d+\.\d+)\n.*CNNscore\s(\d+\.\d+)\n.*CNNaffinity\s(\d+\.\d+)')
for file_path in path.glob("*"):
    with file_path.open('r') as file:
        text = file.read()
    for match in pattern.finditer(text):
        print(match.groups())
        result.append(match.groups())

# Create DataFrame
df = pd.DataFrame(result, columns=['minimizedAffinity', 'CNNscore', 'CNNaffinity'])
Output:
('-7.11687565', '0.573647082', '5.82644749') ('-6.61898327', '0.55260396', '5.86855984') ('-7.7777777', '0.000000', '7.8888') ('-6.77777', '0.111111', '5.5555')
If look at the DataFrame now.
>>> df
  minimizedAffinity     CNNscore CNNaffinity
0       -7.11687565  0.573647082  5.82644749
1       -6.61898327   0.55260396  5.86855984
2        -7.7777777     0.000000      7.8888
3          -6.77777     0.111111      5.5555
>>> df.CNNaffinity.max()
'7.8888'
So index could be Model as it will enumerate automatic when read all files.
Also data will be more useful if shall do something with it,as eg find max in CNNaffinity column as shown.