Python Forum
sum all ints of the same line
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
sum all ints of the same line
#1
hi, Im trying to calculate the mean of a mutation level of chromosomes.
Im caculating each chromosome mutation level with this line:

formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
now I need to take this formula and add it up with all the other formulas of the same chromosomes, and divide it by the number of times that this chromosome appear, which I dont really know how to do or how to approach it..
this is my whole code:
def vcfToDict(file):
    if isVCF(file) == True:
        chromosome = {}
    else:
        raise ValueError("Invalid VCF Format")
    with open(file, "r+") as my_file:
        # First I wanted to clear the headline
        for line in my_file:
            if line.startswith("#"): # skip comment lines.
                continue
            line=line.rstrip('\n').split('\t')
            # This is the info column
            info = line[7].split(";")
            # Using slicing I extracted the DP4 part and removed the str DP4
            DP4 = info[-2].replace("DP4=","")
            # Then I took all the int outs and put them under the categories
            ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
            # Get chromosome number from first field
            chr_num = line[0].replace('chr', '')  
            # # calculated the mean for each one
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            mut = (f'{line[3]}->{line[4]}')
            if chr_num not in chromosome:
                    chromosome[chr_num] = {}
            chromosome[chr_num][mut] = round(formula, 2)
        return chromosome
So I got this output:
{'1': {'T->C': 0.8, 'A->C': 0.8}, '3': {'T->C': 0.99, 'G->C': 0.45}, '5': {'A->G': 0.58}, '6': {'A->G': 0.89}, '7': {'A->G': 0.63, 'T->C': 0.6}, '8': {'T->C': 0.62}, '10': {'T->C': 0.44}, '11': {'T->C': 0.46}, '12': {'A->G': 0.63}, '15': {'A->G': 0.41}, '17': {'A->G': 0.48}, '18': {'A->G': 0.65}, 'X': {'T->C': 0.75}}
but this is the output I want to get:
{'1': {'T->C': 0.8, 'A->C': 0.8},
         '3': {'T->C': 0.76, 'G->C': 0.45},
         '5': {'A->G': 0.5},
         '6': {'A->G': 0.7},
         '7': {'A->G': 0.63, 'T->C': 0.6},
         '8': {'T->C': 0.62},
         '10': {'T->C': 0.62},
         '11': {'T->C': 0.46},
         '12': {'A->G': 0.63},
         '15': {'A->G': 0.41},
         '17': {'A->G': 0.48},
         '18': {'A->G': 0.65},
         'X': {'T->C': 0.65}}
the point is that this line:
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
is giving me all the numbers for each line, and then I need to see if its with the same chromosome and it got the same mutation (T>C) I need to add it up and divide it by the times the chromosome appear.. which I have no clue how to do and Ill really appreciate some help.
An example of the file:
chr1	143755378	.	T	C	62	.	DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20	GT:PL:DP	1/1:89,179,0:548
chr3	57644487	.	T	C	16.4448	.	DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20	GT:PL:DP	0/1:49,0,63:300
chr3	80706912	.	T	C	212	.	DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20	GT:PL:DP	1/1:239,255,0:298
chr5	33415937	.	A	G	3.70385	.	DP=164;VDB=5.60519e-45;SGB=-0.693147;RPB=0.968056;MQB=1;BQB=0.823667;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,69,0;MQ=20	GT:PL:DP	0/1:34,0,50:164
chr5	125456605	.	A	G	29.3037	.	DP=634;VDB=0;SGB=-0.693147;RPB=0.984591;MQB=1;BQB=0.886466;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=268,0,366,0;MQ=20	GT:PL:DP	0/1:62,0,49:634
chr6	31515667	.	A	G	11.696	.	DP=196;VDB=0;SGB=-0.693147;RPB=0.64451;MQB=1;BQB=7.78455e-10;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,101,0;MQ=20	GT:PL:DP	0/1:44,0,32:196
chr6	47947521	.	A	G	67	.	DP=197;VDB=0;SGB=-0.693147;RPB=0.977655;MQB=1;BQB=0.956096;MQ0F=0;AC=2;AN=2;DP4=21,0,176,0;MQ=20	GT:PL:DP	1/1:94,255,0:197
chr7	99320235	.	A	G	17.1829	.	INDEL;IDV=17;IMF=0.0944444;DP=180;VDB=1.40845e-24;SGB=-0.693141;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,22,0,37;MQ=20	GT:PL:DP	0/1:50,0,39:59
chr7	140810212	.	T	C	21.3105	.	DP=255;VDB=0;SGB=-0.693147;RPB=0.98221;MQB=1;BQB=0.998742;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,103,0,152;MQ=20	GT:PL:DP	0/1:54,0,25:255
chr8	70395182	.	T	C	14.6593	.	DP=260;VDB=0;SGB=-0.693147;RPB=0.980704;MQB=1;BQB=0.999982;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,98,0,162;MQ=20	GT:PL:DP	0/1:46,0,9:260
chr10	49244330	.	T	C	89	.	DP=421;VDB=0.0623096;SGB=-0.693147;RPB=0.669382;MQB=1;MQSB=1;BQB=0.99251;MQ0F=0;AC=2;AN=2;DP4=75,18,261,67;MQ=20	GT:PL:DP	1/1:117,115,0:421
chr10	49272776	.	T	C	14.2865	.	DP=282;VDB=0;SGB=-0.693147;RPB=0.997141;MQB=1;BQB=0.466456;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,104,0,178;MQ=20	GT:PL:DP	0/1:41,0,0:282
chr10	49272789	.	T	C	5.69849	.	DP=283;VDB=0;SGB=-0.693147;RPB=0.919973;MQB=1;BQB=0.85173;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,159,0,124;MQ=20	GT:PL:DP	0/1:37,0,59:283
So in chr7 I shouldnt add up because although its starts the same, the mutation is not in the same nucleotides, however In chr3 they are
Reply
#2
(Jun-06-2021, 07:52 PM)ranbarr Wrote: hi, Im trying to calculate the mean of a mutation level of chromosomes.
Im caculating each chromosome mutation level with this line:

formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
now I need to take this formula and add it up with all the other formulas of the same chromosomes, and divide it by the number of times that this chromosome appear, which I dont really know how to do or how to approach it..
this is my whole code:
def vcfToDict(file):
    if isVCF(file) == True:
        chromosome = {}
    else:
        raise ValueError("Invalid VCF Format")
    with open(file, "r+") as my_file:
        # First I wanted to clear the headline
        for line in my_file:
            if line.startswith("#"): # skip comment lines.
                continue
            line=line.rstrip('\n').split('\t')
            # This is the info column
            info = line[7].split(";")
            # Using slicing I extracted the DP4 part and removed the str DP4
            DP4 = info[-2].replace("DP4=","")
            # Then I took all the int outs and put them under the categories
            ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
            # Get chromosome number from first field
            chr_num = line[0].replace('chr', '')  
            # # calculated the mean for each one
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            mut = (f'{line[3]}->{line[4]}')
            if chr_num not in chromosome:
                    chromosome[chr_num] = {}
            chromosome[chr_num][mut] = round(formula, 2)
        return chromosome
So I got this output:
{'1': {'T->C': 0.8, 'A->C': 0.8}, '3': {'T->C': 0.99, 'G->C': 0.45}, '5': {'A->G': 0.58}, '6': {'A->G': 0.89}, '7': {'A->G': 0.63, 'T->C': 0.6}, '8': {'T->C': 0.62}, '10': {'T->C': 0.44}, '11': {'T->C': 0.46}, '12': {'A->G': 0.63}, '15': {'A->G': 0.41}, '17': {'A->G': 0.48}, '18': {'A->G': 0.65}, 'X': {'T->C': 0.75}}
but this is the output I want to get:
{'1': {'T->C': 0.8, 'A->C': 0.8},
         '3': {'T->C': 0.76, 'G->C': 0.45},
         '5': {'A->G': 0.5},
         '6': {'A->G': 0.7},
         '7': {'A->G': 0.63, 'T->C': 0.6},
         '8': {'T->C': 0.62},
         '10': {'T->C': 0.62},
         '11': {'T->C': 0.46},
         '12': {'A->G': 0.63},
         '15': {'A->G': 0.41},
         '17': {'A->G': 0.48},
         '18': {'A->G': 0.65},
         'X': {'T->C': 0.65}}
the point is that this line:
formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
is giving me all the numbers for each line, and then I need to see if its with the same chromosome and it got the same mutation (T>C) I need to add it up and divide it by the times the chromosome appear.. which I have no clue how to do and Ill really appreciate some help.
An example of the file:
chr1	143755378	.	T	C	62	.	DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20	GT:PL:DP	1/1:89,179,0:548
chr3	57644487	.	T	C	16.4448	.	DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20	GT:PL:DP	0/1:49,0,63:300
chr3	80706912	.	T	C	212	.	DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20	GT:PL:DP	1/1:239,255,0:298
chr5	33415937	.	A	G	3.70385	.	DP=164;VDB=5.60519e-45;SGB=-0.693147;RPB=0.968056;MQB=1;BQB=0.823667;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,69,0;MQ=20	GT:PL:DP	0/1:34,0,50:164
chr5	125456605	.	A	G	29.3037	.	DP=634;VDB=0;SGB=-0.693147;RPB=0.984591;MQB=1;BQB=0.886466;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=268,0,366,0;MQ=20	GT:PL:DP	0/1:62,0,49:634
chr6	31515667	.	A	G	11.696	.	DP=196;VDB=0;SGB=-0.693147;RPB=0.64451;MQB=1;BQB=7.78455e-10;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=95,0,101,0;MQ=20	GT:PL:DP	0/1:44,0,32:196
chr6	47947521	.	A	G	67	.	DP=197;VDB=0;SGB=-0.693147;RPB=0.977655;MQB=1;BQB=0.956096;MQ0F=0;AC=2;AN=2;DP4=21,0,176,0;MQ=20	GT:PL:DP	1/1:94,255,0:197
chr7	99320235	.	A	G	17.1829	.	INDEL;IDV=17;IMF=0.0944444;DP=180;VDB=1.40845e-24;SGB=-0.693141;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,22,0,37;MQ=20	GT:PL:DP	0/1:50,0,39:59
chr7	140810212	.	T	C	21.3105	.	DP=255;VDB=0;SGB=-0.693147;RPB=0.98221;MQB=1;BQB=0.998742;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,103,0,152;MQ=20	GT:PL:DP	0/1:54,0,25:255
chr8	70395182	.	T	C	14.6593	.	DP=260;VDB=0;SGB=-0.693147;RPB=0.980704;MQB=1;BQB=0.999982;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,98,0,162;MQ=20	GT:PL:DP	0/1:46,0,9:260
chr10	49244330	.	T	C	89	.	DP=421;VDB=0.0623096;SGB=-0.693147;RPB=0.669382;MQB=1;MQSB=1;BQB=0.99251;MQ0F=0;AC=2;AN=2;DP4=75,18,261,67;MQ=20	GT:PL:DP	1/1:117,115,0:421
chr10	49272776	.	T	C	14.2865	.	DP=282;VDB=0;SGB=-0.693147;RPB=0.997141;MQB=1;BQB=0.466456;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,104,0,178;MQ=20	GT:PL:DP	0/1:41,0,0:282
chr10	49272789	.	T	C	5.69849	.	DP=283;VDB=0;SGB=-0.693147;RPB=0.919973;MQB=1;BQB=0.85173;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,159,0,124;MQ=20	GT:PL:DP	0/1:37,0,59:283
So in chr7 I shouldnt add up because although its starts the same, the mutation is not in the same nucleotides, however In chr3 they are
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020