Posts: 1
Threads: 1
Joined: Jun 2023
Jun-12-2023, 02:19 AM
(This post was last modified: Jun-12-2023, 02:19 AM by lephunghien.)
I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.
I have a folder that contain hundreds of *.pdbqt files named "compound_1.pdbqt", "compound_2.pdbqt", .etc.
Each file have a structure like this:
Output: MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK 11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK 11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model
I want to use a python (using Python 3) script to exact the "MODEL", "minimizedAffinity", "CNNscore", and "CNNaffinity" of each and every compound in the folder into a delimited text file that look like this:
Output: Compound Model minimizedAffinity CNNscore CNNaffinity
1 1 -7.11687565 0.573647082 5.82644749
1 2 -6.61898327 0.55260396 5.86855984
Currently I am stuck at this script
#! /usr/bin/env python
import sys
import glob
files = glob.glob('**/*.pdbqt',
recursive = True)
for file in files:
word1 = 'MODEL'
word2 = 'minimizedAffinity'
word3 = 'CNNscore'
word4 = 'CNNaffinity'
print(file)
with open(file) as fp:
# read all lines in a list
lines = fp.readlines()
for line in lines:
# check if string present on a current line
if line.find(word1) != -1:
print('Line:', line)
if line.find(word2) != -1:
print('Line:', line)
if line.find(word3) != -1:
print('Line:', line)
if line.find(word4) != -1:
print('Line:', line) Really appreciate any help.
Thank you very much.
Posts: 299
Threads: 72
Joined: Apr 2019
Jun-12-2023, 07:26 AM
(This post was last modified: Jun-12-2023, 07:26 AM by paul18fr.)
Maybe the following snippet might help you
it has been supposed here that: - you loop into a list of files to get the compound number (see CurrentFile variable)
- the pattern is always the same
- the file can be huge, so i prefer to use
readline() instead readlines()
Of course it can be improved following your needs
Hope it helps a bit
import re, os
Path = str(os.getcwd())
Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity']
CurrentFile = 'compound_1.pdbqt'
Compound = int(re.search(r"compound_(\d+)", CurrentFile)[1])
with open(Path + '/' + CurrentFile, 'r') as data:
while True:
Line = data.readline()
if Line == '': break
if 'MODEL' in Line:
Model = int(re.search(r"MODEL\s(\d+)", Line)[1])
Line = data.readline()[:-1]
minimizedAffinity = float(re.split(" ",Line)[2])
Line = data.readline()[:-1]
CNNscore = float(re.split(" ",Line)[2])
Line = data.readline()[:-1]
CNNaffinity = float(re.split(" ",Line)[2])
Results.append([Compound, Model, minimizedAffinity, CNNscore, CNNaffinity])
del Model, minimizedAffinity, CNNscore, CNNaffinity Output: Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity',
[1, 1, -7.11687565, 0.573647082, 5.82644749],
[1, 2, -6.61898327, 0.55260396, 5.86855984]]
Posts: 8,151
Threads: 160
Joined: Sep 2016
Is it safe to assume that when there is line starting with MODEL , after it there will be always at least 3 lines starting with REMARK for minimizedAffinity , CNNscore and CNNaffinity and always in that order? Then one more line REMARK and then multiple lines, until again next MODEL?
Posts: 299
Threads: 72
Joined: Apr 2019
@buran: I agree; that's why I've been speaking about the "pattern"
The snipet can be easily adapted, and it must probably be improved if we are dealing with a huge data ( append is not the best tool)
Posts: 8,151
Threads: 160
Joined: Sep 2016
@ paul18fr - my question was at OP. I may do things a bit differently from you, but virtually the same approach if this pattern is certain.
Posts: 6,778
Threads: 20
Joined: Feb 2020
Jun-12-2023, 05:06 PM
(This post was last modified: Jun-12-2023, 05:08 PM by deanhystad.)
import re
from io import StringIO
file = StringIO(
"""MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK 11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK 11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model"""
)
pattern = re.compile(r"REMARK (minimizedAffinity|CNNscore|CNNaffinity) (\S+)")
data = {}
model = None
for line in file:
if line.startswith("MODEL"):
model = {}
data[int(line.split()[1])] = model
elif model is not None:
match = re.match(pattern, line)
if match and len(match.groups()) > 1:
model[match.group(1)] = float(match.group(2))
print(*data.items(), sep="\n") Output: (1, {'minimizedAffinity': -7.11687565, 'CNNscore': 0.573647082, 'CNNaffinity': 5.82644749})
(2, {'minimizedAffinity': -6.61898327, 'CNNscore': 0.55260396, 'CNNaffinity': 5.86855984})
Posts: 7,312
Threads: 123
Joined: Sep 2016
Jun-12-2023, 05:40 PM
(This post was last modified: Jun-12-2023, 05:40 PM by snippsat.)
The structure he wants look like tabular structure,i would use Pandas here.
deanhystad way is ok,but is not a tabular structure,then it will a lot of reception of names when files are big.
To show a example of what mean here read a couple test files in current folder,and take the result into Pandas.
from pathlib import Path
import pandas as pd
import re
result = []
path = Path(".")
pattern = re.compile(r'minimizedAffinity\s(-?\d+\.\d+)\n.*CNNscore\s(\d+\.\d+)\n.*CNNaffinity\s(\d+\.\d+)')
for file_path in path.glob("*"):
with file_path.open('r') as file:
text = file.read()
for match in pattern.finditer(text):
print(match.groups())
result.append(match.groups())
# Create DataFrame
df = pd.DataFrame(result, columns=['minimizedAffinity', 'CNNscore', 'CNNaffinity']) Output: ('-7.11687565', '0.573647082', '5.82644749')
('-6.61898327', '0.55260396', '5.86855984')
('-7.7777777', '0.000000', '7.8888')
('-6.77777', '0.111111', '5.5555')
If look at the DataFrame now.
>>> df
minimizedAffinity CNNscore CNNaffinity
0 -7.11687565 0.573647082 5.82644749
1 -6.61898327 0.55260396 5.86855984
2 -7.7777777 0.000000 7.8888
3 -6.77777 0.111111 5.5555
>>> df.CNNaffinity.max()
'7.8888' So index could be Model as it will enumerate automatic when read all files.
Also data will be more useful if shall do something with it,as eg find max in CNNaffinity column as shown.
|