If decomposed to spoken language task can be represented with quite simple steps (it's more or less brute-force but according to Donald Knuth premature optimisation is root of almost all evil in programming):
- filter values
- create all pair combinations
- remove pair combinations created from values of one key
For trying out ideas Python interactive interpretator (or Jupyter) are always your best friends. So let's try to express the 'easy' steps in spoken language in Python. Following is one uninterrupted session (thus using _).
Filter values
We have dictionary, we have values (
d.values()
). We need filtered values (I assume that "the central character to be eg 'A', 'P' or 'V'" means that on index 5 there must be one of those letters). One obvious way is to use built-in functon
filter, but it's out of fashion nowadeays so we use list comprehension - we will loop through value (list) of every key and for every key create list of elements which meet criteria.
In [1]: d = {'1arcH': ['DKATIPSESPF', 'KATIPSESPFA', 'ATIPSESPFAA', 'TIPSESPFAAA'],
...: '1aabH': ['GVSGSCNIDVV', 'VSGSCNIDVVC', 'SGSCNIDVVCP', 'GSCNIDVVCPE'],
...: '1cevH': ['DISSTEIAVYW', 'ISSTEIAVYWG', 'SSTEIAVYWGQ', 'STEIAVYWGQR', 'TEIAVYWGQRE', 'EIAVYWGQRED']}
In [2]: [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]
Out[2]: [['DKATIPSESPF'], [], ['SSTEIAVYWGQ', 'STEIAVYWGQR']]
We now have filtered values (note that we have empty list(s) for values where no matches were found)
Create all pair combinations
One way is to try construct all pairs by ourselves. However, there is built-in module
itertools for efficient iterating which we could take advantage of. Specifically there are
chain and
combinations functions.
With chain we can chain all values into one iterable and with combinations we can get all pairs from that iterable.
In [4]: from itertools import chain, combinations
In [3]: list(chain(*_))
Out[3]: ['DKATIPSESPF', 'SSTEIAVYWGQ', 'STEIAVYWGQR'] # all filtered values
In [4]: list(combinations(_, 2))
Out[4]:
[('DKATIPSESPF', 'SSTEIAVYWGQ'), # all pairs of filtered values
('DKATIPSESPF', 'STEIAVYWGQR'),
('SSTEIAVYWGQ', 'STEIAVYWGQR')]
Now we have all pair combinations.
Remove pair combinations created from values of one key
We need to know which pairs are generated from elements in value list of one key. We could use same technique as earlier: filter and create combinations, but this time only from values inside the list. We will chain these list right away:
In [5]: [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]
Out[5]: [['DKATIPSESPF'], [], ['SSTEIAVYWGQ', 'STEIAVYWGQR']]
In [6]: list(chain(*(combinations(el, 2) for el in _)))
Out[6]: [('SSTEIAVYWGQ', 'STEIAVYWGQR')]
Now we have list of pairs we want to filter out from all pairs. There are several techiques to do so and this related to parameters of task at hand. For example, can there be repeated pairs or all pairs are unique? Do we want retain repetitions or not?
If we have or want only unique values, then we can utilize built-in data structure/function
set (which supports
.difference method). If there are duplicates and we want to keep them then we can use list comprehension (or filter). There is however, one important thing - chain and combinations returning iterators which can be used/consumed only once. We must be extra careful not to consume them several times (or convert iterators to lists/tuples for use multiple times).
In [7]: all_combinations = [('DKATIPSESPF', 'SSTEIAVYWGQ'),
...: ('DKATIPSESPF', 'STEIAVYWGQR'),
...: ('SSTEIAVYWGQ', 'STEIAVYWGQR')]
In [8]: inner_combinations = [('SSTEIAVYWGQ', 'STEIAVYWGQR')]
In [9]: set(all_combinations).difference(inner_combinations) # only unique pairs not in inner pairs
Out[9]: {('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')}
In [10]: [pair for pair in all_combinations if pair not in inner_combinations] # all pairs which are not in inner pairs
Out[10]: [('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')]
Now we have set or list of pairs and we can utilize some function from random module to get required sample.
Full code is quite short (as task in spoken language, this is the beauty of Python):
from itertools import chain, combinations
d = {'1arcH': ['DKATIPSESPF', 'KATIPSESPFA', 'ATIPSESPFAA', 'TIPSESPFAAA'],
'1aabH': ['GVSGSCNIDVV', 'VSGSCNIDVVC', 'SGSCNIDVVCP', 'GSCNIDVVCPE'],
'1cevH': ['DISSTEIAVYW', 'ISSTEIAVYWG', 'SSTEIAVYWGQ', 'STEIAVYWGQR', 'TEIAVYWGQRE', 'EIAVYWGQRED']}
values = [[sequence for sequence in value if sequence[5] in ['A', 'P', 'V']] for value in d.values()]
flat = chain(*values)
all_combinations = combinations(flat, 2)
inner_combinations = chain(*(combinations(el, 2) for el in values))
set(all_combinations).difference(inner_combinations)
# output
{('DKATIPSESPF', 'SSTEIAVYWGQ'), ('DKATIPSESPF', 'STEIAVYWGQR')}