Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Computing GC Content
#1
I have been trying to find the solution to a programming problem which asks me to find the GC content. The example online displays:

Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output
Rosalind_0808
60.919540

My code runs properly and I already tried entering the solutions I find with the correct amount of decimal place values.

from collections import Counter
myDna = ''
with open('rosalind_gc.txt', 'r') as data:
     for line in data:
          if '>' in line:
               continue
          myDna += line.strip()
myNucleotideCounts = Counter(myDna)
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))
print('Dna GC Content = {0}'.format(myGC)) 
Would someone be able to provide suggestions as to what might be the problem and how to fix this in the above code?
Reply
#2
I'm not familiar with collections, so I didn't use that module here.

Apparently, as I've been told here, creating a string myDNA = '' and then appending to it actually creates many strings, because a string is immutable, hangs around in memory. So it is better to use a list, then join the list!

Anyway, this may give you some ideas:

#! /usr/bin/python3
import random

# play God, make DNA, make a Python!
def makeDNA():
    myDNA = []
    nucleotides = ['A', 'C', 'G', 'T']
    length = input('How long do you want your DNA? Just enter a number. ')
    num = int(length)
    for i in range(0, num):
        choice = random.choice(nucleotides)
        myDNA.append(choice)
    mystring = ''.join(myDNA)
    return mystring

def countNucs(DNA_sample):
    countC = 0
    countG = 0
    for nuc in DNA_sample:    
        if nuc == 'C':
            countC +=1
        elif nuc == 'G':
                countG +=1
    print('Cytosine was found', countC, 'times in this sample of DNA.')
    print('Guanine was found', countG, 'times in this sample of DNA.')

if __name__ == "__main__":
    DNA_sample = makeDNA()
    countNucs(DNA_sample)
Reply
#3
(Jun-25-2023, 10:04 PM)uwl Wrote: which asks me to find the GC content

Is that the entire specification of the problem? It looks to me like it wants you to examine the GC content of different files and only print the filename and GC content in some situations (perhaps with a GC content greater than some threshold?)

Your program is calculating a single value for all lines in the file. It doesn't display the fiiename like the sample output does.
Reply
#4
It occurred to me this morning that you may want to count the number of times the sequence GC appears in the DNA sample.

def countGC(DNA):
    countGC = 0    
    for i in range(len(DNA)):    
        if DNA[i] == 'G' and DNA[i+1] == 'C':
            countGC +=1        
    print('The sequence GC was found', countGC, 'times in this sample of DNA.')
For a DNA sample 501 long that gave:

Quote:countGC(DNA_sample)
The sequence GC was found 31 times in this sample of DNA.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Matlab to Python -- Parallel Computing zistambo 1 1,979 Jun-10-2020, 04:59 PM
Last Post: pyzyx3qwerty
  Computing correlation in audio files ryanblumenow 0 2,779 Jan-15-2020, 06:11 PM
Last Post: ryanblumenow
  computing entropy using pickle files baran01 2 2,442 Dec-30-2019, 09:45 PM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020