Python Forum
Inverted Index - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Inverted Index (/thread-36356.html)



Inverted Index - thewal - Feb-10-2022

Hi,
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing

def preprocess(document):
    document = document.lower() # Lowercase
    words = tokenizer.tokenize(document) # Tokenize
    words = [w for w in words if not w in stop_words] # Stopwords
    for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]:
        words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words]
    return Counter(words)
The read in the files and pre-process
path = 'C:/my_files/'

files = os.listdir(path)
print(len(files))

collection = {}

for file in files:
    file_path=path+file
    
    tree = ET.parse(file_path)
    root = tree.getroot()
    
   

    doc_id = root.find('DOCID').text
    header = root.find('HEADLINE').text 
    text = root.find('TEXT').text
    if header == None: header = '' 
    if text != None:
#If there is no text, then concatenate text and header 
       final_text = header+text
#Otherwise, just take the header
    else:
       final_text = header
       
    collection[doc_id] = preprocess(final_text)
Then below is my attempt at creating an Inverted Index.

def inverted_index(data):
    all_words = collection
    index = {}
    for word in all_words:
        for doc, tokens in data.items():
            if word in tokens :
                if word in index.keys():
                    index[word].append(doc)
                else:
                    index[word] = [doc]
    return index
Now I believe that I need the data to come out something like below

Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}

Any help would great,

Thanks