Calculate mean only for match strings

ranbarr · May-31-2021, 03:17 PM

Hi,
I have this assignment in which I have a file that contains alot of chromosed that I need to calculate for each one of them the mutation level.
The problem is that each chromosome can appear several times and I need to find the mean for all the mutation levels of this chromosome.
the mutation level is calculate by DP4 under INFO which contains four numbers that represented as [ref+,ref-,alt+,alt-]
Example of the file:

Output:#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Aligned.sortedByCoord.out.bam
chr1	143755378	.	T	C	62	.	DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20	GT:PL:DP	1/1:89,179,0:548
chr3	57644487	.	T	C	16.4448	.	DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20	GT:PL:DP	0/1:49,0,63:300
chr3	80706912	.	T	C	212	.	DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20	GT:PL:DP	1/1:239,255,0:298

So this what I did until now and Im kinda stuck not really knowing how to continue from that point:

def vcf(file):
with open(file, "r+") as my_file:
        """First I wanted to clear the headline"""
        for columns in my_file:
            if columns.startswith("#"):
                continue
            """Then I split the file into columns"""
            for columns in my_file:
                columns=columns.rstrip('\n').split('\t')
                """This is the info column"""
                for row in columns[7]:
                    row = columns[7].split(";")
                    """Using slicing I extracted the DP4 part and removed the str DP4"""
                DP4 = [row[-2]]
                new_DP4 = [x.replace("DP4=","") for x in DP4]
                """Then I took all the int outs and put them under the categories"""
                for x in new_DP4:
                    xyz = x.split(",")
                ref_plus = int(xyz[0])
                ref_minus = int(xyz[1])
                alt_plus = int(xyz[2])
                alt_minus = int(xyz[3])
                """calculated the mean for each one"""
                formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
                """made a list of the chromosomes and their means"""
                chr_form = [columns[0] , columns[3], columns[4], (formula)]

Right now my goal is to find the mean to each one of the chromosmed (the average mean of chr3 for example)
the output I want to get in the end is like that:

Output:
{1: {‘T->C’: 0.802}, 3: {‘T->C’:0.446}}

Hope I was clear and Ill appreciate any kind of help! Thanks!

ranbarr · Jun-01-2021, 10:03 AM

I tried to go to another direction and this is it:

def vcfToDict(file):
    if isVCF(file) == True:
        chromosome = {}
    else:
        return ValueError("Invalid VCF Format")
    with open(file, "r+") as my_file:
        # First I wanted to clear the headline
        for line in my_file:
            if line.startswith("#"): # skip comment lines.
                continue
            line=line.rstrip('\n').split('\t')
            # This is the info column
            info = line[7].split(";")
            # Using slicing I extracted the DP4 part and removed the str DP4
            DP4 = info[-2].replace("DP4=","")
            # Then I took all the int outs and put them under the categories
            ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
            # calculated the mean for each one
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            # Get chromosome number from first field
            chr_num = (line[0].replace('chr', ''))
            chromosome[chr_num] = {f'{line[3]}->{line[4]}': formula}

    return chromosome

the only thing I still don't know to do is to calculate the average mean to all same chromosomes. my output:

Output:
{'1': {'T->C': 0.8029197080291971}, '3': {'T->C': 0.9932885906040269}, '5': {'A->G': 0.5772870662460567}, '6': {'A->G': 0.8934010152284264}, '7': {'T->C': 0.596078431372549}, '8': {'T->C': 0.6230769230769231}, '10': {'T->C': 0.4381625441696113}, '11': {'T->C': 0.463768115942029}, '12': {'A->G': 0.6264150943396226}, '15': {'A->G': 0.41358024691358025}, '17': {'A->G': 0.48336594911937375}, '18': {'A->G': 0.6528497409326425}, 'X': {'T->C': 0.7526041666666666}}

but this is the output I want to get:

Output:{'1': {'T->C': 0.8, 'A->C': 0.8},
         '3': {'T->C': 0.76, 'G->C': 0.45},
         '5': {'A->G': 0.5},
         '6': {'A->G': 0.7},
         '7': {'A->G': 0.63, 'T->C': 0.6},
         '8': {'T->C': 0.62},
         '10': {'T->C': 0.62},
         '11': {'T->C': 0.46},
         '12': {'A->G': 0.63},
         '15': {'A->G': 0.41},
         '17': {'A->G': 0.48},
         '18': {'A->G': 0.65},
         'X': {'T->C': 0.65}}

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Strings inside other strings - substrings	OmarSinno	2	4,465	Oct-06-2017, 09:58 AM Last Post: gruntfutuk

Calculate mean only for match strings

User Panel Messages

Announcements