Please help NLP: Stanza

xinyulon · (This post was last modified: May-08-2022, 09:54 PM by xinyulon.)

Here is the question:

The directory data contains 10 articles from the Estonian Wikipedia, whose filenames follow the pattern et_wiki_X.txt, in which X stands for a number that identifies the article.

Open each file, read the contents and store the resulting string objects into a list named texts.

Prepare these texts for processing using Stanza by creating Document objects without annotations.

Store the resulting Document objects into a list named docs_in.

Here is my answer:

import stanza
from pathlib import Path

nlp_et = stanza.Pipeline(lang='et')
corpus_dir = Path('data')
files = list(corpus_dir.glob(pattern='*_*_*.txt'))
for file in files:
    texts = []
    text = file.read_text(encoding='utf-8')
    texts.append(text)
    docs_in = []
    processed = nlp_et(text)
    docs_in.append(processed)

I wonder how can stanza create an Document object without annotations? Aren't stanza bound to have annotations? I tried put below, however it doesn't seem right. Could any one please offer a hint?

nlp_et = stanza.Pipeline(lang='et', processors = ' ')

Thank you so much! Heart

Please help NLP: Stanza

User Panel Messages

Announcements