Python Forum
Need help for a python script to extract information from a list of files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help for a python script to extract information from a list of files
#1
Question 
I am new to Python and need help to make a script for the task. I am running a modified Autodock program and need to compile the results.

I have a folder that contain hundreds of *.pdbqt files named "compound_1.pdbqt", "compound_2.pdbqt", .etc.

Each file have a structure like this:

Output:
MODEL 1 REMARK minimizedAffinity -7.11687565 REMARK CNNscore 0.573647082 REMARK CNNaffinity 5.82644749 REMARK 11 active torsions: #Lots of text here MODEL 2 REMARK minimizedAffinity -6.61898327 REMARK CNNscore 0.55260396 REMARK CNNaffinity 5.86855984 REMARK 11 active torsions: #Lots of text here #Repeat with 10 to 20 model
I want to use a python (using Python 3) script to exact the "MODEL", "minimizedAffinity", "CNNscore", and "CNNaffinity" of each and every compound in the folder into a delimited text file that look like this:

Output:
Compound Model minimizedAffinity CNNscore CNNaffinity 1 1 -7.11687565 0.573647082 5.82644749 1 2 -6.61898327 0.55260396 5.86855984
Currently I am stuck at this script

#! /usr/bin/env python

import sys
import glob

files = glob.glob('**/*.pdbqt', 
                   recursive = True)
for file in files:
    word1 = 'MODEL'
    word2 = 'minimizedAffinity'
    word3 = 'CNNscore'
    word4 = 'CNNaffinity'
    print(file)
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            if line.find(word1) != -1:
                print('Line:', line)
            if line.find(word2) != -1:
                print('Line:', line)
            if line.find(word3) != -1:
                print('Line:', line)
            if line.find(word4) != -1:
                print('Line:', line)
Really appreciate any help.

Thank you very much.
Reply
#2
Maybe the following snippet might help you

it has been supposed here that:
  • you loop into a list of files to get the compound number (see CurrentFile variable)
  • the pattern is always the same
  • the file can be huge, so i prefer to use readline() instead readlines()

Of course it can be improved following your needs

Hope it helps a bit

import re, os
Path = str(os.getcwd())

Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity']

CurrentFile = 'compound_1.pdbqt'
Compound = int(re.search(r"compound_(\d+)", CurrentFile)[1])


with open(Path + '/' + CurrentFile, 'r') as data:
    
    while True:
        Line = data.readline()
        if Line == '': break

        if 'MODEL' in Line:
            Model = int(re.search(r"MODEL\s(\d+)", Line)[1])
            
            Line = data.readline()[:-1]
            minimizedAffinity = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNscore = float(re.split(" ",Line)[2])
            
            Line = data.readline()[:-1]
            CNNaffinity = float(re.split(" ",Line)[2])
            
            Results.append([Compound, Model, minimizedAffinity, CNNscore, CNNaffinity])
            del Model, minimizedAffinity, CNNscore, CNNaffinity
Output:
Results = ['Compound', 'Model', 'minimizedAffinity', 'CNNscore', 'CNNaffinity', [1, 1, -7.11687565, 0.573647082, 5.82644749], [1, 2, -6.61898327, 0.55260396, 5.86855984]]
Reply
#3
Is it safe to assume that when there is line starting with MODEL, after it there will be always at least 3 lines starting with REMARK for minimizedAffinity, CNNscore and CNNaffinity and always in that order? Then one more line REMARK and then multiple lines, until again next MODEL?
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
@buran: I agree; that's why I've been speaking about the "pattern"

The snipet can be easily adapted, and it must probably be improved if we are dealing with a huge data (append is not the best tool)
Reply
#5
@paul18fr - my question was at OP. I may do things a bit differently from you, but virtually the same approach if this pattern is certain.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#6
import re
from io import StringIO


file = StringIO(
"""MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model"""
)
pattern = re.compile(r"REMARK (minimizedAffinity|CNNscore|CNNaffinity) (\S+)")
data = {}
model = None

for line in file:
    if line.startswith("MODEL"):
        model = {}
        data[int(line.split()[1])] = model
    elif model is not None:
        match = re.match(pattern, line)
        if match and len(match.groups()) > 1:
            model[match.group(1)] = float(match.group(2))

print(*data.items(), sep="\n")
Output:
(1, {'minimizedAffinity': -7.11687565, 'CNNscore': 0.573647082, 'CNNaffinity': 5.82644749}) (2, {'minimizedAffinity': -6.61898327, 'CNNscore': 0.55260396, 'CNNaffinity': 5.86855984})
Reply
#7
The structure he wants look like tabular structure,i would use Pandas here.
deanhystad way is ok,but is not a tabular structure,then it will a lot of reception of names when files are big.

To show a example of what mean here read a couple test files in current folder,and take the result into Pandas.
from pathlib import Path
import pandas as pd
import re

result = []
path = Path(".")
pattern = re.compile(r'minimizedAffinity\s(-?\d+\.\d+)\n.*CNNscore\s(\d+\.\d+)\n.*CNNaffinity\s(\d+\.\d+)')
for file_path in path.glob("*"):
    with file_path.open('r') as file:
        text = file.read()
    for match in pattern.finditer(text):
        print(match.groups())
        result.append(match.groups())

# Create DataFrame
df = pd.DataFrame(result, columns=['minimizedAffinity', 'CNNscore', 'CNNaffinity'])
Output:
('-7.11687565', '0.573647082', '5.82644749') ('-6.61898327', '0.55260396', '5.86855984') ('-7.7777777', '0.000000', '7.8888') ('-6.77777', '0.111111', '5.5555')
If look at the DataFrame now.
>>> df
  minimizedAffinity     CNNscore CNNaffinity
0       -7.11687565  0.573647082  5.82644749
1       -6.61898327   0.55260396  5.86855984
2        -7.7777777     0.000000      7.8888
3          -6.77777     0.111111      5.5555
>>> df.CNNaffinity.max()
'7.8888'
So index could be Model as it will enumerate automatic when read all files.
Also data will be more useful if shall do something with it,as eg find max in CNNaffinity column as shown.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Is it possible to extract 1 or 2 bits of data from MS project files? cubangt 8 1,070 Feb-16-2024, 12:02 AM
Last Post: deanhystad
  Is there a *.bat DOS batch script to *.py Python Script converter? pstein 3 3,283 Jun-29-2023, 11:57 AM
Last Post: gologica
  script to calculate data in csv-files ledgreve 0 1,111 May-19-2023, 07:24 AM
Last Post: ledgreve
  How do I extract information from this dataset? SuchUmami 0 712 May-04-2023, 02:41 PM
Last Post: SuchUmami
  list the files using query in python arjunaram 0 679 Mar-28-2023, 02:39 PM
Last Post: arjunaram
  Extract value from a list thesquid 2 865 Nov-29-2022, 01:54 PM
Last Post: thesquid
  python extract mg24 1 955 Nov-02-2022, 06:30 PM
Last Post: Larz60+
  SQL Alchemy help to extract sql data into csv files mg24 1 1,792 Sep-30-2022, 04:43 PM
Last Post: Larz60+
  How to download a list of files from FTP? schnarkle 0 1,016 Jun-21-2022, 10:35 PM
Last Post: schnarkle
  Extract parts of multiple log-files and put it in a dataframe hasiro 4 2,098 Apr-27-2022, 12:44 PM
Last Post: hasiro

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020