Feb-10-2022, 03:54 PM
Hi,
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing
Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}
Any help would great,
Thanks
I am trying to create an Inverted Index, but I cant seem to get it working.
So I have read in xml files that contain ID, DESC and TEXT, and have done pre-processing on them, i.e. remove stop words etc.
See my code below so far.
First my pre-processing
def preprocess(document): document = document.lower() # Lowercase words = tokenizer.tokenize(document) # Tokenize words = [w for w in words if not w in stop_words] # Stopwords for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]: words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words] return Counter(words)The read in the files and pre-process
path = 'C:/my_files/' files = os.listdir(path) print(len(files)) collection = {} for file in files: file_path=path+file tree = ET.parse(file_path) root = tree.getroot() doc_id = root.find('DOCID').text header = root.find('HEADLINE').text text = root.find('TEXT').text if header == None: header = '' if text != None: #If there is no text, then concatenate text and header final_text = header+text #Otherwise, just take the header else: final_text = header collection[doc_id] = preprocess(final_text)Then below is my attempt at creating an Inverted Index.
def inverted_index(data): all_words = collection index = {} for word in all_words: for doc, tokens in data.items(): if word in tokens : if word in index.keys(): index[word].append(doc) else: index[word] = [doc] return indexNow I believe that I need the data to come out something like below
Inverted_Index ={'cat': ['doc_1', doc_5'], 'cow':['doc_4', 'doc_20']}
Any help would great,
Thanks