Apr-15-2018, 10:05 AM
(This post was last modified: Apr-15-2018, 03:41 PM by Gribouillis.)
I think you can speed up things by generating the individual matching lines first. The following code prints a record for each word appearing in the five columns. Each record contains five lists of all the indexes where the word appears in the columns of
sa_data
from collections import defaultdict, namedtuple def word_idx_dict(seq): di = defaultdict(list) for i, k in enumerate(seq): di[k].append(i) return di def line_matches(sa_data): dics = [word_idx_dict(seq) for seq in sa_data] # get words appearing in all the columns s = set(dics[0]) for d in dics[1:]: s &= set(d) words = sorted(s) Record = namedtuple('Record', 'word rownums') for w in words: yield Record(word=w, rownums=[d[w] for d in dics]) for rec in line_matches(sa_data): print(rec)I think this sequence of records is fast to generate and it should be a better starting point than the raw
sa_data
array.