Inverted Index

thewal · Feb-10-2022, 03:54 PM

Hi,
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing

def preprocess(document):
    document = document.lower() # Lowercase
    words = tokenizer.tokenize(document) # Tokenize
    words = [w for w in words if not w in stop_words] # Stopwords
    for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]:
        words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words]
    return Counter(words)

The read in the files and pre-process

path = 'C:/my_files/'

files = os.listdir(path)
print(len(files))

collection = {}

for file in files:
    file_path=path+file
    
    tree = ET.parse(file_path)
    root = tree.getroot()
    
   

    doc_id = root.find('DOCID').text
    header = root.find('HEADLINE').text 
    text = root.find('TEXT').text
    if header == None: header = '' 
    if text != None:
#If there is no text, then concatenate text and header 
       final_text = header+text
#Otherwise, just take the header
    else:
       final_text = header
       
    collection[doc_id] = preprocess(final_text)

Then below is my attempt at creating an Inverted Index.

def inverted_index(data):
    all_words = collection
    index = {}
    for word in all_words:
        for doc, tokens in data.items():
            if word in tokens :
                if word in index.keys():
                    index[word].append(doc)
                else:
                    index[word] = [doc]
    return index

Now I believe that I need the data to come out something like below

Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}

Any help would great,

Thanks

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[split] Getting Index Error - list index out of range	krishna	2	3,590	Jan-09-2021, 08:29 AM Last Post: buran
	Getting Index Error - list index out of range	RahulSingh	2	7,309	Feb-03-2020, 07:17 AM Last Post: RahulSingh

Inverted Index

User Panel Messages

Announcements