Python Forum

Full Version: Add NER output to pandas dataframe
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm trying to write a script that will scan a document and add the extracted information to a dataframe, currently it just prints it and exits. Obviously, the goal is to be able to associate certain named entities with documents where they're mentioned. The place where I'm truly lost is how to process the NER output into separate rows, it should be easy because it's just a list with fairly clear separations, but I'm not sure if there's a particular library I should use, or exactly how to get these two aspects to "talk"

my working code is here, I also have a not working version where I've attempted to add the dataframe functionality, but I feel it's so far off from correct to be not worth including
#!/usr/bin/env python3

import os
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

from transformers import pipeline
ner_cls = pipeline("ner", model=model, tokenizer=tokenizer)


x = input ('Enter document:')

document = open (x, "r").read() 
ner_results = ner_cls(document)

organized_results = {'LOC': [], 'PER': [], 'ORG': [], 'MISC': []}

current_entity = None
current_words = []

for result in ner_results:
    entity_type = result['entity'].split('-')[1]
    if result['entity'].startswith('B-'):
        if current_entity:
            organized_results[current_entity].append(' '.join(current_words))
        current_entity = entity_type
        current_words = [result['word']]
    elif result['entity'].startswith('I-') and current_entity == entity_type:
        current_words.append(result['word'])

# Handle the last entity
if current_entity:
    organized_results[current_entity].append(' '.join(current_words))

# Remove hash symbols from words
for key, value in organized_results.items():
    organized_results[key] = [' '.join(word.split('##')) for word in value]

print(organized_results)