Python Forum

Hi,

I recently started learning Python programming and develop my skills by working on some various "script" to help me in my work as biologist.

I have the following file :

Quote: CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4
CruSTS5_GC_30000 AUGUSTUS transcript 13036 15467 0.24 - . g4.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS terminal 13036 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS initial 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 13499 13554 1 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14513 14721 0.81 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14817 14952 0.99 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13039 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS gene 15900 17819 0.36 - . g5
CruSTS5_GC_30000 AUGUSTUS transcript 16909 17819 0.19 - . g5.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS terminal 16909 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS initial 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17177 17231 0.99 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17346 17403 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17493 17548 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17670 17727 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 16912 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id "g5.t1"; gene_id "g5";

It is a .gtf file with DNA data and I would like to extract some specific data from it.
I have seen there is a script to manage gtf file but for learning purpose, I would like to work on my personal script.

I want to extract lines with "start_codon" and "stop_codon", combine each successive lines, which can be (start_codon + stop_codon or stop_codon + start_codon), then run a small if statement to tell me the orientation (start --> stop or stop <-- start) and generate a small table, potentially as csv file including the name of the gene, its orientation and the position of the stop and start codon.

What I have done as beginner, is to open my .gtf file and treat it as any .txt file, read the different lines, remove the characters that would be a problem (; and ") and convert them as lists, selecting elements that I want to keep based on their index and print out "start_codon and "stop_codon".

import re

DataFile = "MyGTFFile.gtf"

with open (DataFile, "r") as GCData:
    Data = GCData.readlines()
        
    for Lines in Data:
        Lines = Lines.strip() #Remove return to line at the end
        Lines = re.sub("\s+", "\t", Lines) #Replace multiple spaces by a tabulation
        Lines = re.sub(";", "", Lines) #Replace ; by nothing ("")
        Lines = re.sub('"', "", Lines) #Replace " by nothing ("")

        if re.findall(("start_codon|stop_codon"), Lines): #sorting using "|" as "or"
            Lines = Lines.split("\t") #Convert string to list
            IndexToKeep = [2, 3, 4, 9] #List of index to keep
            Lines = [Index for Index in Lines if Lines.index(Index) in IndexToKeep]
            if "start_codon" in Lines:
                StartLine = Lines
                print(StartLine)
            else:
                StopLine = Lines
                print(StopLine)

I have several questions to solve my challenge:

- Am I right in my process and converting them as lists or should I use a different approach?

- With my small script, I would like to combine the successive lists "start_codon" with "stop_codon" or "stop_codon" with "start_codon". I would like to merge them in the order they appear in the file because it would give me the orientation and data with be more easy to analyse after that. I could not find any approach to merge, two outputs obtained each one from a different if statement. What would be the best solution?

- After generating this unique line, am I right if I plan to run several if statements using value index to extract data and generate a summary .csv file?

Thank you in advance.

Is this what you are trying to do?

file = 'MyGTFFile.gtf'

mylist = []

with open(file, 'r') as data:
    lines = data.readlines()
    for line in lines:
        if 'start_codon' or 'stop_codon' in line:
            mylist.append(line.replace('"', '').replace(';', '').strip('\n'))

for item in mylist:
    print(item)

Output:CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4
CruSTS5_GC_30000 AUGUSTUS transcript 13036 15467 0.24 - . g4.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS terminal 13036 13498 0.57 - 1 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS internal 13555 14512 0.97 - 2 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS internal 14722 14816 0.96 - 1 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS initial 14953 15467 0.59 - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS intron 13499 13554 1 - . transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS intron 14513 14721 0.81 - . transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS intron 14817 14952 0.99 - . transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS CDS 13039 13498 0.57 - 1 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS CDS 13555 14512 0.97 - 2 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS CDS 14722 14816 0.96 - 1 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS CDS 14953 15467 0.59 - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS gene 15900 17819 0.36 - . g5
CruSTS5_GC_30000 AUGUSTUS transcript 16909 17819 0.19 - . g5.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS terminal 16909 17176 0.27 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS internal 17232 17345 0.99 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS internal 17404 17492 1 - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS internal 17549 17669 1 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS initial 17728 17819 0.69 - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS intron 17177 17231 0.99 - . transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS intron 17346 17403 1 - . transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS intron 17493 17548 1 - . transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS intron 17670 17727 1 - . transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS CDS 16912 17176 0.27 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS CDS 17232 17345 0.99 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS CDS 17404 17492 1 - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS CDS 17549 17669 1 - 1 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS CDS 17728 17819 0.69 - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id g5.t1 gene_id g5

Nevermind, Just seen that I didn't get the specified lines

Now only specified lines are stored in a list

file = 'MyGTFFile.gtf'

mylist = []

with open(file, 'r') as data:
    lines = data.readlines()
    for line in lines:
        if 'start_codon' in line or 'stop_codon' in line:
            mylist.append(line.replace('"', '').replace(';', '').strip('\n'))

for item in mylist:
    print(item)

Output:CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id g4.t1 gene_id g4
CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id g5.t1 gene_id g5
CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id g5.t1 gene_id g5

@menator01: there is no actual need to use .readline (and additional memory). One can directly iterate over fileobject (for line in data:). By analyzing underlying data if-condition can be simplified as well: if "_codon" in line:. Printing out list can be done little bit shorter and without repeating 'item': print(*mylist, sep='\n')

Hi menator01,

Thanks you for your help.

With my current script, my output is the following:

Output:['stop_codon', '13036', '13038', 'g4.t1']
['start_codon', '15465', '15467', 'g4.t1']
['stop_codon', '16909', '16911', 'g5.t1']
['start_codon', '17817', '17819', 'g5.t1']
['stop_codon', '18965', '18967', 'g6.t2']
['start_codon', '22307', '22309', 'g6.t2']
['start_codon', '22846', '22848', 'g7.t1']
['stop_codon', '24171', '24173', 'g7.t1']

Now, I would like to combine the first line with the second line and the third line with the fourth line, and so on with all the "start_codon" and "stop_codon" lines to get an output like this:

Output:['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467', 'g4.t1']
['stop_codon', '16909', '16911', 'g5.t1', 'start_codon', '17817', '17819', 'g5.t1']
['stop_codon', '18965', '18967', 'g6.t2', 'start_codon', '22307', '22309', 'g6.t2']
['start_codon', '22846', '22848', 'g7.t1', 'stop_codon', '24171', '24173', 'g7.t1']

or, merging the identical values (gX.tX)

Output:['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467']
['stop_codon', '16909', '16911', 'g5.t1', 'start_codon', '17817', '17819']
['stop_codon', '18965', '18967', 'g6.t2', 'start_codon', '22307', '22309']
['start_codon', '22846', '22848', 'g7.t1', 'stop_codon', '24171', '24173']

And the next step would be to generate a small text / output and .csv file based on the list index and get something like:

Output:Your gene g4.t1 is anti-sens (<--), starts at position 15467 and and ends at 13036
Your gene g5.t1 is anti-sens (<--), starts at position 16909 and and ends at 17819
Your gene g6.t1 is anti-sens (<--), starts at position 18965 and and ends at 22309
Your gene g7.t1 is sens (-->), starts at position 22846 and and ends at 24173

As I recently learning with python, based on the structure/formatting of my .gtf file, and how I would like to export them, I thought that formatting my data as list would be the best approach. Am I right? or converting them as string would be more easy to manipulate them for export?

The main challenge I am now facing, is how to append two lines ( start_codon with stop_codon), when each output were obtained from a if loop (if start_codon = StartLine, else it is a StopLine).

I think that in a first time, I would like to try by myself to code the part to generate the text and .csv file but I would definitely appreciate help to merge "start/stop_codon" or "stop/start_codon" when they are both output from the same if statement.

As beginner in Python programming, I am always open to any remarks or ideas, especially on my coding approach.

Thanks again for your help.

(Jan-17-2022, 11:21 AM)fgaascht Wrote: [ -> ]Now, I would like to combine the first line with the second line and the third line with the fourth line, and so on with all the "start_codon" and "stop_codon" lines to get an output like this:

For this is zip() common to use.

>>> new_lst = zip(lst[0::2], lst[1::2])
# Or fancier 
#new_lst = zip(*(iter(lst),) * 2)
>>> new_lst
<zip object at 0x0000020471CE7080>
>>> for item in new_lst:
...     print(item)    
...     
(['stop_codon', '13036', '13038', 'g4.t1'], ['start_codon', '15465', '15467', 'g4.t1'])
(['stop_codon', '16909', '16911', 'g5.t1'], ['start_codon', '17817', '17819', 'g5.t1'])
(['stop_codon', '18965', '18967', 'g6.t2'], ['start_codon', '22307', '22309', 'g6.t2'])
(['start_codon', '22846', '22848', 'g7.t1'], ['stop_codon', '24171', '24173', 'g7.t1'])

So now need to join list together,this can be do with + or extend.

>>> l1 = [1, 2]
>>> l2 = [3, 4]
>>> l1 + l2
[1, 2, 3, 4]
>>> l1.extend(l2)
>>> l1
[1, 2, 3, 4]

(Jan-17-2022, 11:21 AM)fgaascht Wrote: [ -> ]or, merging the identical values (gX.tX)

Like this to preserve the ordering.

>>> lst = ['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467', 'g4.t1']
>>> list(dict.fromkeys(lst))
['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467']

Can do it like because from Python 3.7--> are dictionaries guaranteed to keep order.
Unlike set() which still is unordered.

>>> set(lst)
{'15467', '13038', '13036', 'start_codon', '15465', 'g4.t1', 'stop_codon'}

from operator import itemgetter
from itertools import pairwise


def unquote(text):
    return text.replace("'", "").replace('"', "")


def combine_codons(data):

    # using itemgetter to access to the `fields`
    # itemgetter return a callable
    # calling the callable with an object, will return the seclected elements of the object
    # itemgetter(1)(some_list) -> will return the second element from `some_list`
    get_first = itemgetter(0, 1, 2, 3, 4, -3)
    get_second = itemgetter(2, 3, 4, -3)
    # negative indeicies starting on the right side of the list
    # -1 is the last element in the list
    # -2 is the second last element ...

    buffer = []
    for line in data.splitlines():
        row = line.split()
        if len(row) == 12 and row[2] in ("start_codon", "stop_codon"):
            buffer.append(row)

    if len(buffer) % 2 != 0:
        # what should happen, if your data is incomplete?
        print("WARNING: Count of start_codon + stop_codon is odd")

    results = []
    # itertools.pairwise was introduced with Python 3.10
    # if you're not able to use it, you can use more_itertools instead
    for first, second in pairwise(buffer):
        # get_first and get_second is the itemgetter
        # the * will unpack the elements
        # you can do it more then once in a list
        combined_row = [*get_first(first), *get_second(second)]

        # optional
        # unquoting the 6th element and the last element
        # 1st element is at index 0
        # so 6th element is at index 5
        # counting starts with 0
        combined_row[5] = unquote(combined_row[5])
        combined_row[-1] = unquote(combined_row[-1])

        # combined result is ready, put them into results
        results.append(combined_row)

    return results


# data as string
data = """CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4
CruSTS5_GC_30000 AUGUSTUS transcript 13036 15467 0.24 - . g4.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS terminal 13036 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS initial 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 13499 13554 1 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14513 14721 0.81 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14817 14952 0.99 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13039 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS gene 15900 17819 0.36 - . g5
CruSTS5_GC_30000 AUGUSTUS transcript 16909 17819 0.19 - . g5.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS terminal 16909 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS initial 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17177 17231 0.99 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17346 17403 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17493 17548 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17670 17727 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 16912 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id "g5.t1"; gene_id "g5";"""

if __name__ == "__main__":
    results = combine_codons(data)

    for result in results:
        # result is a list
        # you can convert lists to a string with spaces between elements
        result_text = " ".join(result)
        print(result_text)

I have to make assumptions what is desired result as my understanding of this subject is quite hazy. I would try to collect data into dictionary instead of list. Why? Better readability and dictionaries in modern Python are guaranteed to keep insertion order.

So I would:

- read file line by line
- process line if there is '_codon' by:
- splitting line (sample data indicates, that _codon lines have similar structure),
- picking and converting needed values
- adding values to dictionary

In code it could look like:

sequences = dict()

with open('dna_data', 'r') as f:
    for line in f:
        if '_codon' in line:
            record = line.split()
            end = record[2]
            values = (int(record[3]), int(record[4]))
            identity = record[9].strip(';').strip('"')
            
            try:
                sequences[identity][end] = values 
            except KeyError:
                sequences[identity] = {end: values}

print(sequences)

Output:{'g4.t1': {'stop_codon': (13036, 13038), 'start_codon': (15465, 15467)},
 'g5.t1': {'stop_codon': (16909, 16911), 'start_codon': (17817, 17819)}}

Now I can iterate over dictionary and output data I want. As mentioned earlier - ordering of stop and start will be by insertion i.e. first encountered is first and second encountered is second. In sample data on both cases stop_codon was first. If I need one value then I can use min and max respectively.

fgaascht

menator01

perfringo

fgaascht

snippsat

DeaD_EyE

perfringo