May-31-2021, 03:17 PM
Hi,
I have this assignment in which I have a file that contains alot of chromosed that I need to calculate for each one of them the mutation level.
The problem is that each chromosome can appear several times and I need to find the mean for all the mutation levels of this chromosome.
the mutation level is calculate by DP4 under INFO which contains four numbers that represented as [ref+,ref-,alt+,alt-]
Example of the file:
the output I want to get in the end is like that:
I have this assignment in which I have a file that contains alot of chromosed that I need to calculate for each one of them the mutation level.
The problem is that each chromosome can appear several times and I need to find the mean for all the mutation levels of this chromosome.
the mutation level is calculate by DP4 under INFO which contains four numbers that represented as [ref+,ref-,alt+,alt-]
Example of the file:
Output:#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Aligned.sortedByCoord.out.bam
chr1 143755378 . T C 62 . DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20 GT:PL:DP 1/1:89,179,0:548
chr3 57644487 . T C 16.4448 . DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20 GT:PL:DP 0/1:49,0,63:300
chr3 80706912 . T C 212 . DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20 GT:PL:DP 1/1:239,255,0:298
So this what I did until now and Im kinda stuck not really knowing how to continue from that point:def vcf(file): with open(file, "r+") as my_file: """First I wanted to clear the headline""" for columns in my_file: if columns.startswith("#"): continue """Then I split the file into columns""" for columns in my_file: columns=columns.rstrip('\n').split('\t') """This is the info column""" for row in columns[7]: row = columns[7].split(";") """Using slicing I extracted the DP4 part and removed the str DP4""" DP4 = [row[-2]] new_DP4 = [x.replace("DP4=","") for x in DP4] """Then I took all the int outs and put them under the categories""" for x in new_DP4: xyz = x.split(",") ref_plus = int(xyz[0]) ref_minus = int(xyz[1]) alt_plus = int(xyz[2]) alt_minus = int(xyz[3]) """calculated the mean for each one""" formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus)) """made a list of the chromosomes and their means""" chr_form = [columns[0] , columns[3], columns[4], (formula)]Right now my goal is to find the mean to each one of the chromosmed (the average mean of chr3 for example)
the output I want to get in the end is like that:
Output:{1: {‘T->C’: 0.802}, 3: {‘T->C’:0.446}}
Hope I was clear and Ill appreciate any kind of help! Thanks!