Dec-18-2020, 05:42 PM
(This post was last modified: Dec-19-2020, 07:40 AM by marcel2807.)
Hello everyone,
It's my first time on a forum but also my last hope in not failing my programming class. I've tried ours but no results. Here you can find what's my project about. I did the point 1,3,4,5 and I have a big issue with the numbers 2,6 and 7. Here is what I have to do:
Reconstructing evolution of a virus is critical to understand how its genome mutates
over time or during a pandemic as well as how the relative infection rate and clinical
presentation may vary in function of these mutations. For example, in the context of
COVID19 pandemic, the reconstruction of SARS-COV2 evolution can be found at
https://nextstrain.org/ncov/global.
We may encode the genome of a virus as a string of length L over the alphabet
Σ={A,C,T,G}. For example, the following string is the incipit of the Zaire Ebola virus:
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
whereas the following string is the incipit of Crimean-Congo hemorrhagic fever virus
TCTCAAAGAAACACGTGCCGCTTACGCCCACAGTGTTCTCTTGAGTGTTAGCAGAATGGAAAACAAGATC.
This project is about designing and developing algorithms to manipulate a set of viral
genomes and to reconstruct their evolution. In particular, you have to write a python
program that:
1. Reads from the provided file the list of viral genomes; (1pt)
2. For each pair of genomes, lists all of the longest common patterns of length at least
5. (1pt)
A common pattern is defined as a substring that is shared by a pair of virus. In the
example above, Ebola and Crimean-Congo hemorrhagic fever virus share the
substring AAAGAAA, which has length 7.
3. Computes the distance matrix between genome pairwise by using the algorithm
shown in Figure 1; (1pt)
4. Outputs the minimum spanning tree induced by the above distance matrix, in the
form of a list of edges, e.g.,
edge 1 edge 2
edge 3 edge 4
…
edge n-2 edge n-1
(1,5pt)
5. Outputs (as in point 4) the shortest path tree induced by the above matrix; (0.5pt)
7. Finds the unrooted binary tree T which minimizes the following formula
where lij is the length of the path in T between the genome i and the genome j. Is T
topologically equivalent to the tree computed in point 6? (4pt)
Suggestions:
1. If n as the number of input sequences, then the number of internal vertices of
an unrooted binary tree is 2n-2.
2. The number of binary trees grows according to the law (2n-5)!!=1x3x5x7x…
(2n-5) where n=|G|.
3. The unrooted binary trees can be enumerated by starting from a star-tree
encoded e.g., as
T = { (1,n+1), (2,n+1), (3,n+1) }
and by inserting recursively each of the remaining genomes in G in one of the
edges of T. For example Suppose that G={1,2,3,4}. Then, the three possible
unrooted binary trees are:
i. T = { (2,n+1), (3,n+1), (4,n+2), (1,n+2), (n+1,n+2) }
ii. T = { (1,n+1), (3,n+1), (4,n+2), (2,n+2), (n+1,n+2) }
iii. T = { (1,n+1), (2,n+1), (4,n+2), (3,n+2), (n+1,n+2) }
As you can see, the insertion operation selects one vertex in G, say k, selects
one edge of T, say (u,v), removes (u,v) from T, and appends three new edges
in T, namely
iv. the edge (k, new internal vertex)
v. the edge (u, new internal vertex)
vi. the edge (v, new internal vertex)
4. Use recursion to compute {lij} matrix, then proceed with the computation of
the above formula.
It's my first time on a forum but also my last hope in not failing my programming class. I've tried ours but no results. Here you can find what's my project about. I did the point 1,3,4,5 and I have a big issue with the numbers 2,6 and 7. Here is what I have to do:
Reconstructing evolution of a virus is critical to understand how its genome mutates
over time or during a pandemic as well as how the relative infection rate and clinical
presentation may vary in function of these mutations. For example, in the context of
COVID19 pandemic, the reconstruction of SARS-COV2 evolution can be found at
https://nextstrain.org/ncov/global.
We may encode the genome of a virus as a string of length L over the alphabet
Σ={A,C,T,G}. For example, the following string is the incipit of the Zaire Ebola virus:
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
whereas the following string is the incipit of Crimean-Congo hemorrhagic fever virus
TCTCAAAGAAACACGTGCCGCTTACGCCCACAGTGTTCTCTTGAGTGTTAGCAGAATGGAAAACAAGATC.
This project is about designing and developing algorithms to manipulate a set of viral
genomes and to reconstruct their evolution. In particular, you have to write a python
program that:
1. Reads from the provided file the list of viral genomes; (1pt)
2. For each pair of genomes, lists all of the longest common patterns of length at least
5. (1pt)
A common pattern is defined as a substring that is shared by a pair of virus. In the
example above, Ebola and Crimean-Congo hemorrhagic fever virus share the
substring AAAGAAA, which has length 7.
3. Computes the distance matrix between genome pairwise by using the algorithm
shown in Figure 1; (1pt)
4. Outputs the minimum spanning tree induced by the above distance matrix, in the
form of a list of edges, e.g.,
edge 1 edge 2
edge 3 edge 4
…
edge n-2 edge n-1
(1,5pt)
5. Outputs (as in point 4) the shortest path tree induced by the above matrix; (0.5pt)
7. Finds the unrooted binary tree T which minimizes the following formula
where lij is the length of the path in T between the genome i and the genome j. Is T
topologically equivalent to the tree computed in point 6? (4pt)
Suggestions:
1. If n as the number of input sequences, then the number of internal vertices of
an unrooted binary tree is 2n-2.
2. The number of binary trees grows according to the law (2n-5)!!=1x3x5x7x…
(2n-5) where n=|G|.
3. The unrooted binary trees can be enumerated by starting from a star-tree
encoded e.g., as
T = { (1,n+1), (2,n+1), (3,n+1) }
and by inserting recursively each of the remaining genomes in G in one of the
edges of T. For example Suppose that G={1,2,3,4}. Then, the three possible
unrooted binary trees are:
i. T = { (2,n+1), (3,n+1), (4,n+2), (1,n+2), (n+1,n+2) }
ii. T = { (1,n+1), (3,n+1), (4,n+2), (2,n+2), (n+1,n+2) }
iii. T = { (1,n+1), (2,n+1), (4,n+2), (3,n+2), (n+1,n+2) }
As you can see, the insertion operation selects one vertex in G, say k, selects
one edge of T, say (u,v), removes (u,v) from T, and appends three new edges
in T, namely
iv. the edge (k, new internal vertex)
v. the edge (u, new internal vertex)
vi. the edge (v, new internal vertex)
4. Use recursion to compute {lij} matrix, then proceed with the computation of
the above formula.