Jan-17-2022, 09:00 AM
Hi,
I recently started learning Python programming and develop my skills by working on some various "script" to help me in my work as biologist.
I have the following file :
It is a .gtf file with DNA data and I would like to extract some specific data from it.
I have seen there is a script to manage gtf file but for learning purpose, I would like to work on my personal script.
I want to extract lines with "start_codon" and "stop_codon", combine each successive lines, which can be (start_codon + stop_codon or stop_codon + start_codon), then run a small if statement to tell me the orientation (start --> stop or stop <-- start) and generate a small table, potentially as csv file including the name of the gene, its orientation and the position of the stop and start codon.
What I have done as beginner, is to open my .gtf file and treat it as any .txt file, read the different lines, remove the characters that would be a problem (; and ") and convert them as lists, selecting elements that I want to keep based on their index and print out "start_codon and "stop_codon".
- Am I right in my process and converting them as lists or should I use a different approach?
- With my small script, I would like to combine the successive lists "start_codon" with "stop_codon" or "stop_codon" with "start_codon". I would like to merge them in the order they appear in the file because it would give me the orientation and data with be more easy to analyse after that. I could not find any approach to merge, two outputs obtained each one from a different if statement. What would be the best solution?
- After generating this unique line, am I right if I plan to run several if statements using value index to extract data and generate a summary .csv file?
Thank you in advance.
I recently started learning Python programming and develop my skills by working on some various "script" to help me in my work as biologist.
I have the following file :
Quote: CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4
CruSTS5_GC_30000 AUGUSTUS transcript 13036 15467 0.24 - . g4.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS terminal 13036 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS internal 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS initial 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 13499 13554 1 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14513 14721 0.81 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS intron 14817 14952 0.99 - . transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13039 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS CDS 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id "g4.t1"; gene_id "g4";
CruSTS5_GC_30000 AUGUSTUS gene 15900 17819 0.36 - . g5
CruSTS5_GC_30000 AUGUSTUS transcript 16909 17819 0.19 - . g5.t1
CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS terminal 16909 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS internal 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS initial 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17177 17231 0.99 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17346 17403 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17493 17548 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS intron 17670 17727 1 - . transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 16912 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS CDS 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5";
CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id "g5.t1"; gene_id "g5";
It is a .gtf file with DNA data and I would like to extract some specific data from it.
I have seen there is a script to manage gtf file but for learning purpose, I would like to work on my personal script.
I want to extract lines with "start_codon" and "stop_codon", combine each successive lines, which can be (start_codon + stop_codon or stop_codon + start_codon), then run a small if statement to tell me the orientation (start --> stop or stop <-- start) and generate a small table, potentially as csv file including the name of the gene, its orientation and the position of the stop and start codon.
What I have done as beginner, is to open my .gtf file and treat it as any .txt file, read the different lines, remove the characters that would be a problem (; and ") and convert them as lists, selecting elements that I want to keep based on their index and print out "start_codon and "stop_codon".
import re DataFile = "MyGTFFile.gtf" with open (DataFile, "r") as GCData: Data = GCData.readlines() for Lines in Data: Lines = Lines.strip() #Remove return to line at the end Lines = re.sub("\s+", "\t", Lines) #Replace multiple spaces by a tabulation Lines = re.sub(";", "", Lines) #Replace ; by nothing ("") Lines = re.sub('"', "", Lines) #Replace " by nothing ("") if re.findall(("start_codon|stop_codon"), Lines): #sorting using "|" as "or" Lines = Lines.split("\t") #Convert string to list IndexToKeep = [2, 3, 4, 9] #List of index to keep Lines = [Index for Index in Lines if Lines.index(Index) in IndexToKeep] if "start_codon" in Lines: StartLine = Lines print(StartLine) else: StopLine = Lines print(StopLine)I have several questions to solve my challenge:
- Am I right in my process and converting them as lists or should I use a different approach?
- With my small script, I would like to combine the successive lists "start_codon" with "stop_codon" or "stop_codon" with "start_codon". I would like to merge them in the order they appear in the file because it would give me the orientation and data with be more easy to analyse after that. I could not find any approach to merge, two outputs obtained each one from a different if statement. What would be the best solution?
- After generating this unique line, am I right if I plan to run several if statements using value index to extract data and generate a summary .csv file?
Thank you in advance.