Help with output from if statement - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Help with output from if statement (/thread-36105.html) |
Help with output from if statement - fgaascht - Jan-17-2022 Hi, I recently started learning Python programming and develop my skills by working on some various "script" to help me in my work as biologist. I have the following file : Quote: CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4 It is a .gtf file with DNA data and I would like to extract some specific data from it. I have seen there is a script to manage gtf file but for learning purpose, I would like to work on my personal script. I want to extract lines with "start_codon" and "stop_codon", combine each successive lines, which can be (start_codon + stop_codon or stop_codon + start_codon), then run a small if statement to tell me the orientation (start --> stop or stop <-- start) and generate a small table, potentially as csv file including the name of the gene, its orientation and the position of the stop and start codon. What I have done as beginner, is to open my .gtf file and treat it as any .txt file, read the different lines, remove the characters that would be a problem (; and ") and convert them as lists, selecting elements that I want to keep based on their index and print out "start_codon and "stop_codon". import re DataFile = "MyGTFFile.gtf" with open (DataFile, "r") as GCData: Data = GCData.readlines() for Lines in Data: Lines = Lines.strip() #Remove return to line at the end Lines = re.sub("\s+", "\t", Lines) #Replace multiple spaces by a tabulation Lines = re.sub(";", "", Lines) #Replace ; by nothing ("") Lines = re.sub('"', "", Lines) #Replace " by nothing ("") if re.findall(("start_codon|stop_codon"), Lines): #sorting using "|" as "or" Lines = Lines.split("\t") #Convert string to list IndexToKeep = [2, 3, 4, 9] #List of index to keep Lines = [Index for Index in Lines if Lines.index(Index) in IndexToKeep] if "start_codon" in Lines: StartLine = Lines print(StartLine) else: StopLine = Lines print(StopLine)I have several questions to solve my challenge: - Am I right in my process and converting them as lists or should I use a different approach? - With my small script, I would like to combine the successive lists "start_codon" with "stop_codon" or "stop_codon" with "start_codon". I would like to merge them in the order they appear in the file because it would give me the orientation and data with be more easy to analyse after that. I could not find any approach to merge, two outputs obtained each one from a different if statement. What would be the best solution? - After generating this unique line, am I right if I plan to run several if statements using value index to extract data and generate a summary .csv file? Thank you in advance. RE: Help with output from if statement - menator01 - Jan-17-2022 Is this what you are trying to do? file = 'MyGTFFile.gtf' mylist = [] with open(file, 'r') as data: lines = data.readlines() for line in lines: if 'start_codon' or 'stop_codon' in line: mylist.append(line.replace('"', '').replace(';', '').strip('\n')) for item in mylist: print(item)
Nevermind, Just seen that I didn't get the specified lines Now only specified lines are stored in a list file = 'MyGTFFile.gtf' mylist = [] with open(file, 'r') as data: lines = data.readlines() for line in lines: if 'start_codon' in line or 'stop_codon' in line: mylist.append(line.replace('"', '').replace(';', '').strip('\n')) for item in mylist: print(item)
RE: Help with output from if statement - perfringo - Jan-17-2022 @menator01: there is no actual need to use .readline (and additional memory). One can directly iterate over fileobject ( for line in data: ). By analyzing underlying data if-condition can be simplified as well: if "_codon" in line: . Printing out list can be done little bit shorter and without repeating 'item': print(*mylist, sep='\n')
RE: Help with output from if statement - fgaascht - Jan-17-2022 Hi menator01, Thanks you for your help. With my current script, my output is the following: Now, I would like to combine the first line with the second line and the third line with the fourth line, and so on with all the "start_codon" and "stop_codon" lines to get an output like this: or, merging the identical values (gX.tX) And the next step would be to generate a small text / output and .csv file based on the list index and get something like: As I recently learning with python, based on the structure/formatting of my .gtf file, and how I would like to export them, I thought that formatting my data as list would be the best approach. Am I right? or converting them as string would be more easy to manipulate them for export?The main challenge I am now facing, is how to append two lines ( start_codon with stop_codon), when each output were obtained from a if loop (if start_codon = StartLine, else it is a StopLine). I think that in a first time, I would like to try by myself to code the part to generate the text and .csv file but I would definitely appreciate help to merge "start/stop_codon" or "stop/start_codon" when they are both output from the same if statement. As beginner in Python programming, I am always open to any remarks or ideas, especially on my coding approach. Thanks again for your help. RE: Help with output from if statement - snippsat - Jan-17-2022 (Jan-17-2022, 11:21 AM)fgaascht Wrote: Now, I would like to combine the first line with the second line and the third line with the fourth line, and so on with all the "start_codon" and "stop_codon" lines to get an output like this:For this is zip() common to use. >>> new_lst = zip(lst[0::2], lst[1::2]) # Or fancier #new_lst = zip(*(iter(lst),) * 2) >>> new_lst <zip object at 0x0000020471CE7080> >>> for item in new_lst: ... print(item) ... (['stop_codon', '13036', '13038', 'g4.t1'], ['start_codon', '15465', '15467', 'g4.t1']) (['stop_codon', '16909', '16911', 'g5.t1'], ['start_codon', '17817', '17819', 'g5.t1']) (['stop_codon', '18965', '18967', 'g6.t2'], ['start_codon', '22307', '22309', 'g6.t2']) (['start_codon', '22846', '22848', 'g7.t1'], ['stop_codon', '24171', '24173', 'g7.t1'])So now need to join list together,this can be do with + or extend .>>> l1 = [1, 2] >>> l2 = [3, 4] >>> l1 + l2 [1, 2, 3, 4] >>> l1.extend(l2) >>> l1 [1, 2, 3, 4] (Jan-17-2022, 11:21 AM)fgaascht Wrote: or, merging the identical values (gX.tX)Like this to preserve the ordering. >>> lst = ['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467', 'g4.t1'] >>> list(dict.fromkeys(lst)) ['stop_codon', '13036', '13038', 'g4.t1', 'start_codon', '15465', '15467']Can do it like because from Python 3.7--> are dictionaries guaranteed to keep order. Unlike set() which still is unordered.>>> set(lst) {'15467', '13038', '13036', 'start_codon', '15465', 'g4.t1', 'stop_codon'} RE: Help with output from if statement - DeaD_EyE - Jan-17-2022 from operator import itemgetter from itertools import pairwise def unquote(text): return text.replace("'", "").replace('"', "") def combine_codons(data): # using itemgetter to access to the `fields` # itemgetter return a callable # calling the callable with an object, will return the seclected elements of the object # itemgetter(1)(some_list) -> will return the second element from `some_list` get_first = itemgetter(0, 1, 2, 3, 4, -3) get_second = itemgetter(2, 3, 4, -3) # negative indeicies starting on the right side of the list # -1 is the last element in the list # -2 is the second last element ... buffer = [] for line in data.splitlines(): row = line.split() if len(row) == 12 and row[2] in ("start_codon", "stop_codon"): buffer.append(row) if len(buffer) % 2 != 0: # what should happen, if your data is incomplete? print("WARNING: Count of start_codon + stop_codon is odd") results = [] # itertools.pairwise was introduced with Python 3.10 # if you're not able to use it, you can use more_itertools instead for first, second in pairwise(buffer): # get_first and get_second is the itemgetter # the * will unpack the elements # you can do it more then once in a list combined_row = [*get_first(first), *get_second(second)] # optional # unquoting the 6th element and the last element # 1st element is at index 0 # so 6th element is at index 5 # counting starts with 0 combined_row[5] = unquote(combined_row[5]) combined_row[-1] = unquote(combined_row[-1]) # combined result is ready, put them into results results.append(combined_row) return results # data as string data = """CruSTS5_GC_30000 AUGUSTUS gene 13036 15467 0.24 - . g4 CruSTS5_GC_30000 AUGUSTUS transcript 13036 15467 0.24 - . g4.t1 CruSTS5_GC_30000 AUGUSTUS stop_codon 13036 13038 . - 0 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS terminal 13036 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS internal 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS internal 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS initial 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS intron 13499 13554 1 - . transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS intron 14513 14721 0.81 - . transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS intron 14817 14952 0.99 - . transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS CDS 13039 13498 0.57 - 1 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS CDS 13555 14512 0.97 - 2 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS CDS 14722 14816 0.96 - 1 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS CDS 14953 15467 0.59 - 0 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS start_codon 15465 15467 . - 0 transcript_id "g4.t1"; gene_id "g4"; CruSTS5_GC_30000 AUGUSTUS gene 15900 17819 0.36 - . g5 CruSTS5_GC_30000 AUGUSTUS transcript 16909 17819 0.19 - . g5.t1 CruSTS5_GC_30000 AUGUSTUS stop_codon 16909 16911 . - 0 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS terminal 16909 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS internal 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS internal 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS internal 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS initial 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS intron 17177 17231 0.99 - . transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS intron 17346 17403 1 - . transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS intron 17493 17548 1 - . transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS intron 17670 17727 1 - . transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS CDS 16912 17176 0.27 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS CDS 17232 17345 0.99 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS CDS 17404 17492 1 - 0 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS CDS 17549 17669 1 - 1 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS CDS 17728 17819 0.69 - 0 transcript_id "g5.t1"; gene_id "g5"; CruSTS5_GC_30000 AUGUSTUS start_codon 17817 17819 . - 0 transcript_id "g5.t1"; gene_id "g5";""" if __name__ == "__main__": results = combine_codons(data) for result in results: # result is a list # you can convert lists to a string with spaces between elements result_text = " ".join(result) print(result_text) RE: Help with output from if statement - perfringo - Jan-17-2022 I have to make assumptions what is desired result as my understanding of this subject is quite hazy. I would try to collect data into dictionary instead of list. Why? Better readability and dictionaries in modern Python are guaranteed to keep insertion order. So I would: - read file line by line - process line if there is '_codon' by: - splitting line (sample data indicates, that _codon lines have similar structure), - picking and converting needed values - adding values to dictionary In code it could look like: sequences = dict() with open('dna_data', 'r') as f: for line in f: if '_codon' in line: record = line.split() end = record[2] values = (int(record[3]), int(record[4])) identity = record[9].strip(';').strip('"') try: sequences[identity][end] = values except KeyError: sequences[identity] = {end: values} print(sequences) Now I can iterate over dictionary and output data I want. As mentioned earlier - ordering of stop and start will be by insertion i.e. first encountered is first and second encountered is second. In sample data on both cases stop_codon was first. If I need one value then I can use min and max respectively.
|