Python Forum

So I have an assignment where I have a sequence id's as a key in a dictionary and the values are lists of sequences of a certain length associated with a particular ID. What I need to do is to go through the sequences on the list (which is stored as the value), sort all those away that do not have a certain character as the nth character.

Then I need to get 500 random pair of the sequeces, it is important that a sequences is not paired with another sequence with the same sequence id (key).

I have never worked with this amount of data before, and is very unsure how to go about this.

I probably should look at object oriented programming but I have no idea where to start, can anyone give me any direction?

Thanks in advance!

"What I need to do is to go through the sequences on the list (which is stored as the value), sort all those away that do not have a certain character as the nth character."

Easiest way is to start with dummy dictionary and test different approaches. It could be subset of real data or just:

>>> dummy = {1: ['a', 'b', 'c'], 2: ['a', 'c', 'b'], 3: ['c', 'b', 'a']}

Now you try to filter out keys which don't have 'b' in their values on index 1. Most used dictionary methods are:

>>> dummy.keys()
dict_keys([1, 2, 3])
>>> dummy.values()
dict_values([['a', 'b', 'c'], ['a', 'c', 'b'], ['c', 'b', 'a']])
>>> dummy.items()
dict_items([(1, ['a', 'b', 'c']), (2, ['a', 'c', 'b']), (3, ['c', 'b', 'a'])])

"it is important that a sequences is not paired with another sequence with the same sequence id (key)"

Maybe you elaborate that. Dictionary keys are unique and therefore situation with same keys cannot happen.

Thank you, I will try that approach Smile

What I mean by not having the same sequence id, is that I need to make sure that the random sequence pair
, the two sequences are not from the same list, if that makes sense. The purpose is to compare the sequences that belong to the different sequences id's.

Edit: However, I don't want to remove the key, just the those sequence in the list that doesn't fulfill the criteria, and keep the other sequences associated with that key. I am sorry if that wasn't clear.

It easier to comprehend if you provide some real examples of key-value pairs and how expected outcome should look like.

Is it something like that:

>>> dummy = {1: ['aba', 'baa', 'cba'], 2: ['aaa', 'ccc', 'cbc']}

And the task is find strings in lists where 'b' on index 1 and make pairs of these strings without having two strings from same list as pair?

So I have an dictionary that looks something like this:

Output:
{'1arcH': ['DKATIPSESPF', 'KATIPSESPFA', 'ATIPSESPFAA', 'TIPSESPFAAA']'1aabH': ['GVSGSCNIDVV', 'VSGSCNIDVVC', 'SGSCNIDVVCP', 'GSCNIDVVCPE']'1cevH': ['DISSTEIAVYW', 'ISSTEIAVYWG', 'SSTEIAVYWGQ', 'STEIAVYWGQR', 'TEIAVYWGQRE', 'EIAVYWGQRED']

For each of the values I only want the central character to be eg 'A', 'P' or 'V' so I get a updated dictionary that looks like this:

Output:
{'1arcH': ['DKATIPSESPF']'1aabH': ['GSCNIDVVCPE']'1cevH': ['SSTEIAVYWGQ', 'STEIAVYWGQR']

When that is done,, I want to be able randomly pair 'DKATIPSESPF' with another sequence from another list, now the sequence id is not important anymore so I imaging that this could be stored in a list or an array.

Thank you again!

For 1arcH, why wouldn't KATIPSESPFA match? It contains both an A and a P.

If decomposed to spoken language task can be represented with quite simple steps (it's more or less brute-force but according to Donald Knuth premature optimisation is root of almost all evil in programming):

- filter values
- create all pair combinations
- remove pair combinations created from values of one key

For trying out ideas Python interactive interpretator (or Jupyter) are always your best friends. So let's try to express the 'easy' steps in spoken language in Python. Following is one uninterrupted session (thus using _).

Filter values

We have dictionary, we have values (d.values()). We need filtered values (I assume that "the central character to be eg 'A', 'P' or 'V'" means that on index 5 there must be one of those letters). One obvious way is to use built-in functon filter, but it's out of fashion nowadeays so we use list comprehension - we will loop through value (list) of every key and for every key create list of elements which meet criteria.

In [1]: d = {'1arcH': ['DKATIPSESPF', 'KATIPSESPFA', 'ATIPSESPFAA', 'TIPSESPFAAA'], 
   ...:      '1aabH': ['GVSGSCNIDVV', 'VSGSCNIDVVC', 'SGSCNIDVVCP', 'GSCNIDVVCPE'], 
   ...:      '1cevH': ['DISSTEIAVYW', 'ISSTEIAVYWG', 'SSTEIAVYWGQ', 'STEIAVYWGQR', 'TEIAVYWGQRE', 'EIAVYWGQRED']}               

In [2]: [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]                            
Out[2]: [['DKATIPSESPF'], [], ['SSTEIAVYWGQ', 'STEIAVYWGQR']]

We now have filtered values (note that we have empty list(s) for values where no matches were found)

Create all pair combinations

One way is to try construct all pairs by ourselves. However, there is built-in module itertools for efficient iterating which we could take advantage of. Specifically there are chain and combinations functions.

With chain we can chain all values into one iterable and with combinations we can get all pairs from that iterable.

In [4]: from itertools import chain, combinations                                                                               

In [3]: list(chain(*_))                                                                                                         
Out[3]: ['DKATIPSESPF', 'SSTEIAVYWGQ', 'STEIAVYWGQR']    # all filtered values

In [4]: list(combinations(_, 2))                                                                                                
Out[4]: 
[('DKATIPSESPF', 'SSTEIAVYWGQ'),                         # all pairs of filtered values
 ('DKATIPSESPF', 'STEIAVYWGQR'),
 ('SSTEIAVYWGQ', 'STEIAVYWGQR')]

Now we have all pair combinations.

Remove pair combinations created from values of one key

We need to know which pairs are generated from elements in value list of one key. We could use same technique as earlier: filter and create combinations, but this time only from values inside the list. We will chain these list right away:

In [5]: [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]                            
Out[5]: [['DKATIPSESPF'], [], ['SSTEIAVYWGQ', 'STEIAVYWGQR']]

In [6]: list(chain(*(combinations(el, 2) for el in _)))                                                                         
Out[6]: [('SSTEIAVYWGQ', 'STEIAVYWGQR')]

Now we have list of pairs we want to filter out from all pairs. There are several techiques to do so and this related to parameters of task at hand. For example, can there be repeated pairs or all pairs are unique? Do we want retain repetitions or not?

If we have or want only unique values, then we can utilize built-in data structure/function set (which supports .difference method). If there are duplicates and we want to keep them then we can use list comprehension (or filter). There is however, one important thing - chain and combinations returning iterators which can be used/consumed only once. We must be extra careful not to consume them several times (or convert iterators to lists/tuples for use multiple times).

In [7]: all_combinations = [('DKATIPSESPF', 'SSTEIAVYWGQ'), 
   ...:                     ('DKATIPSESPF', 'STEIAVYWGQR'), 
   ...:                     ('SSTEIAVYWGQ', 'STEIAVYWGQR')]                                                                     

In [8]: inner_combinations = [('SSTEIAVYWGQ', 'STEIAVYWGQR')]                                                                  

In [9]: set(all_combinations).difference(inner_combinations)                    # only unique pairs not in inner pairs                                                
Out[9]: {('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')}

In [10]: [pair for pair in all_combinations if pair not in inner_combinations]  # all pairs which are not in inner pairs                                                 
Out[10]: [('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')]

Now we have set or list of pairs and we can utilize some function from random module to get required sample.

Full code is quite short (as task in spoken language, this is the beauty of Python):

from itertools import chain, combinations

d = {'1arcH': ['DKATIPSESPF', 'KATIPSESPFA', 'ATIPSESPFAA', 'TIPSESPFAAA'],
     '1aabH': ['GVSGSCNIDVV', 'VSGSCNIDVVC', 'SGSCNIDVVCP', 'GSCNIDVVCPE'],
     '1cevH': ['DISSTEIAVYW', 'ISSTEIAVYWG', 'SSTEIAVYWGQ', 'STEIAVYWGQR', 'TEIAVYWGQRE', 'EIAVYWGQRED']}

values = [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]

flat = chain(*values)
all_combinations = combinations(flat, 2)
inner_combinations = chain(*(combinations(el, 2) for el in values))
set(all_combinations).difference(inner_combinations)

# output
{('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')}

Pippi

perfringo

Pippi

perfringo

Pippi

nilamo

perfringo