hi, Im trying to calculate the mean of a mutation level of chromosomes.
Im caculating each chromosome mutation level with this line:
this is my whole code:
An example of the file:
Im caculating each chromosome mutation level with this line:
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))now I need to take this formula and add it up with all the other formulas of the same chromosomes, and divide it by the number of times that this chromosome appear, which I dont really know how to do or how to approach it..
this is my whole code:
def vcfToDict(file): if isVCF(file) == True: chromosome = {} else: raise ValueError("Invalid VCF Format") with open(file, "r+") as my_file: # First I wanted to clear the headline for line in my_file: if line.startswith("#"): # skip comment lines. continue line=line.rstrip('\n').split('\t') # This is the info column info = line[7].split(";") # Using slicing I extracted the DP4 part and removed the str DP4 DP4 = info[-2].replace("DP4=","") # Then I took all the int outs and put them under the categories ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(',')) # Get chromosome number from first field chr_num = line[0].replace('chr', '') # # calculated the mean for each one formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus)) mut = (f'{line[3]}->{line[4]}') if chr_num not in chromosome: chromosome[chr_num] = {} chromosome[chr_num][mut] = round(formula, 2) return chromosomeSo I got this output:
{'1': {'T->C': 0.8, 'A->C': 0.8}, '3': {'T->C': 0.99, 'G->C': 0.45}, '5': {'A->G': 0.58}, '6': {'A->G': 0.89}, '7': {'A->G': 0.63, 'T->C': 0.6}, '8': {'T->C': 0.62}, '10': {'T->C': 0.44}, '11': {'T->C': 0.46}, '12': {'A->G': 0.63}, '15': {'A->G': 0.41}, '17': {'A->G': 0.48}, '18': {'A->G': 0.65}, 'X': {'T->C': 0.75}}but this is the output I want to get:
{'1': {'T->C': 0.8, 'A->C': 0.8}, '3': {'T->C': 0.76, 'G->C': 0.45}, '5': {'A->G': 0.5}, '6': {'A->G': 0.7}, '7': {'A->G': 0.63, 'T->C': 0.6}, '8': {'T->C': 0.62}, '10': {'T->C': 0.62}, '11': {'T->C': 0.46}, '12': {'A->G': 0.63}, '15': {'A->G': 0.41}, '17': {'A->G': 0.48}, '18': {'A->G': 0.65}, 'X': {'T->C': 0.65}}the point is that this line:
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))is giving me all the numbers for each line, and then I need to see if its with the same chromosome and it got the same mutation (T>C) I need to add it up and divide it by the times the chromosome appear.. which I have no clue how to do and Ill really appreciate some help.
An example of the file:
chr1 143755378 . T C 62 . DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20 GT:PL:DP 1/1:89,179,0:548 chr3 57644487 . T C 16.4448 . DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20 GT:PL:DP 0/1:49,0,63:300 chr3 80706912 . T C 212 . DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20 GT:PL:DP 1/1:239,255,0:298 chr5 33415937 . A G 3.70385 . DP=164;VDB=5.60519e-45;SGB=-0.693147;RPB=0.968056;MQB=1;BQB=0.823667;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,69,0;MQ=20 GT:PL:DP 0/1:34,0,50:164 chr5 125456605 . A G 29.3037 . DP=634;VDB=0;SGB=-0.693147;RPB=0.984591;MQB=1;BQB=0.886466;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=268,0,366,0;MQ=20 GT:PL:DP 0/1:62,0,49:634 chr6 31515667 . A G 11.696 . DP=196;VDB=0;SGB=-0.693147;RPB=0.64451;MQB=1;BQB=7.78455e-10;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,101,0;MQ=20 GT:PL:DP 0/1:44,0,32:196 chr6 47947521 . A G 67 . DP=197;VDB=0;SGB=-0.693147;RPB=0.977655;MQB=1;BQB=0.956096;MQ0F=0;AC=2;AN=2;DP4=21,0,176,0;MQ=20 GT:PL:DP 1/1:94,255,0:197 chr7 99320235 . A G 17.1829 . INDEL;IDV=17;IMF=0.0944444;DP=180;VDB=1.40845e-24;SGB=-0.693141;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,22,0,37;MQ=20 GT:PL:DP 0/1:50,0,39:59 chr7 140810212 . T C 21.3105 . DP=255;VDB=0;SGB=-0.693147;RPB=0.98221;MQB=1;BQB=0.998742;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,103,0,152;MQ=20 GT:PL:DP 0/1:54,0,25:255 chr8 70395182 . T C 14.6593 . DP=260;VDB=0;SGB=-0.693147;RPB=0.980704;MQB=1;BQB=0.999982;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,98,0,162;MQ=20 GT:PL:DP 0/1:46,0,9:260 chr10 49244330 . T C 89 . DP=421;VDB=0.0623096;SGB=-0.693147;RPB=0.669382;MQB=1;MQSB=1;BQB=0.99251;MQ0F=0;AC=2;AN=2;DP4=75,18,261,67;MQ=20 GT:PL:DP 1/1:117,115,0:421 chr10 49272776 . T C 14.2865 . DP=282;VDB=0;SGB=-0.693147;RPB=0.997141;MQB=1;BQB=0.466456;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,104,0,178;MQ=20 GT:PL:DP 0/1:41,0,0:282 chr10 49272789 . T C 5.69849 . DP=283;VDB=0;SGB=-0.693147;RPB=0.919973;MQB=1;BQB=0.85173;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,159,0,124;MQ=20 GT:PL:DP 0/1:37,0,59:283So in chr7 I shouldnt add up because although its starts the same, the mutation is not in the same nucleotides, however In chr3 they are