Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Inverted Index
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing

def preprocess(document):
    document = document.lower() # Lowercase
    words = tokenizer.tokenize(document) # Tokenize
    words = [w for w in words if not w in stop_words] # Stopwords
    for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]:
        words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words]
    return Counter(words)
The read in the files and pre-process
path = 'C:/my_files/'

files = os.listdir(path)

collection = {}

for file in files:
    tree = ET.parse(file_path)
    root = tree.getroot()

    doc_id = root.find('DOCID').text
    header = root.find('HEADLINE').text 
    text = root.find('TEXT').text
    if header == None: header = '' 
    if text != None:
#If there is no text, then concatenate text and header 
       final_text = header+text
#Otherwise, just take the header
       final_text = header
    collection[doc_id] = preprocess(final_text)
Then below is my attempt at creating an Inverted Index.

def inverted_index(data):
    all_words = collection
    index = {}
    for word in all_words:
        for doc, tokens in data.items():
            if word in tokens :
                if word in index.keys():
                    index[word] = [doc]
    return index
Now I believe that I need the data to come out something like below

Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}

Any help would great,


Possibly Related Threads…
Thread Author Replies Views Last Post
  [split] Getting Index Error - list index out of range krishna 2 2,729 Jan-09-2021, 08:29 AM
Last Post: buran
  Getting Index Error - list index out of range RahulSingh 2 6,293 Feb-03-2020, 07:17 AM
Last Post: RahulSingh

Forum Jump:

User Panel Messages

Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020