Python Forum
Calculate mean only for match strings
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Calculate mean only for match strings
#1
Hi,
I have this assignment in which I have a file that contains alot of chromosed that I need to calculate for each one of them the mutation level.
The problem is that each chromosome can appear several times and I need to find the mean for all the mutation levels of this chromosome.
the mutation level is calculate by DP4 under INFO which contains four numbers that represented as [ref+,ref-,alt+,alt-]
Example of the file:
Output:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Aligned.sortedByCoord.out.bam chr1 143755378 . T C 62 . DP=550;VDB=0;SGB=-0.693147;RPB=1.63509e-10;MQB=1;BQB=0.861856;MQ0F=0;AC=2;AN=2;DP4=0,108,0,440;MQ=20 GT:PL:DP 1/1:89,179,0:548 chr3 57644487 . T C 16.4448 . DP=300;VDB=0;SGB=-0.693147;RPB=0.993846;MQB=1;BQB=0.316525;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,166,0,134;MQ=20 GT:PL:DP 0/1:49,0,63:300 chr3 80706912 . T C 212 . DP=298;VDB=0;SGB=-0.693147;RPB=0.635135;MQB=1;MQSB=1;BQB=0.609797;MQ0F=0;AC=2;AN=2;DP4=1,1,256,40;MQ=20 GT:PL:DP 1/1:239,255,0:298
So this what I did until now and Im kinda stuck not really knowing how to continue from that point:
def vcf(file):
with open(file, "r+") as my_file:
        """First I wanted to clear the headline"""
        for columns in my_file:
            if columns.startswith("#"):
                continue
            """Then I split the file into columns"""
            for columns in my_file:
                columns=columns.rstrip('\n').split('\t')
                """This is the info column"""
                for row in columns[7]:
                    row = columns[7].split(";")
                    """Using slicing I extracted the DP4 part and removed the str DP4"""
                DP4 = [row[-2]]
                new_DP4 = [x.replace("DP4=","") for x in DP4]
                """Then I took all the int outs and put them under the categories"""
                for x in new_DP4:
                    xyz = x.split(",")
                ref_plus = int(xyz[0])
                ref_minus = int(xyz[1])
                alt_plus = int(xyz[2])
                alt_minus = int(xyz[3])
                """calculated the mean for each one"""
                formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
                """made a list of the chromosomes and their means"""
                chr_form = [columns[0] , columns[3], columns[4], (formula)]
Right now my goal is to find the mean to each one of the chromosmed (the average mean of chr3 for example)
the output I want to get in the end is like that:
Output:
{1: {‘T->C’: 0.802}, 3: {‘T->C’:0.446}}
Hope I was clear and Ill appreciate any kind of help! Thanks!
Reply
#2
I tried to go to another direction and this is it:
def vcfToDict(file):
    if isVCF(file) == True:
        chromosome = {}
    else:
        return ValueError("Invalid VCF Format")
    with open(file, "r+") as my_file:
        # First I wanted to clear the headline
        for line in my_file:
            if line.startswith("#"): # skip comment lines.
                continue
            line=line.rstrip('\n').split('\t')
            # This is the info column
            info = line[7].split(";")
            # Using slicing I extracted the DP4 part and removed the str DP4
            DP4 = info[-2].replace("DP4=","")
            # Then I took all the int outs and put them under the categories
            ref_plus, ref_minus, alt_plus, alt_minus = map(int, DP4.split(','))
            # calculated the mean for each one
            formula = ((alt_minus+alt_plus)/(alt_minus+alt_plus+ref_minus+ref_plus))
            # Get chromosome number from first field
            chr_num = (line[0].replace('chr', ''))
            chromosome[chr_num] = {f'{line[3]}->{line[4]}': formula}

    return chromosome
the only thing I still don't know to do is to calculate the average mean to all same chromosomes. my output:
Output:
{'1': {'T->C': 0.8029197080291971}, '3': {'T->C': 0.9932885906040269}, '5': {'A->G': 0.5772870662460567}, '6': {'A->G': 0.8934010152284264}, '7': {'T->C': 0.596078431372549}, '8': {'T->C': 0.6230769230769231}, '10': {'T->C': 0.4381625441696113}, '11': {'T->C': 0.463768115942029}, '12': {'A->G': 0.6264150943396226}, '15': {'A->G': 0.41358024691358025}, '17': {'A->G': 0.48336594911937375}, '18': {'A->G': 0.6528497409326425}, 'X': {'T->C': 0.7526041666666666}}
but this is the output I want to get:
Output:
{'1': {'T->C': 0.8, 'A->C': 0.8}, '3': {'T->C': 0.76, 'G->C': 0.45}, '5': {'A->G': 0.5}, '6': {'A->G': 0.7}, '7': {'A->G': 0.63, 'T->C': 0.6}, '8': {'T->C': 0.62}, '10': {'T->C': 0.62}, '11': {'T->C': 0.46}, '12': {'A->G': 0.63}, '15': {'A->G': 0.41}, '17': {'A->G': 0.48}, '18': {'A->G': 0.65}, 'X': {'T->C': 0.65}}
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Strings inside other strings - substrings OmarSinno 2 3,648 Oct-06-2017, 09:58 AM
Last Post: gruntfutuk

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020