May-09-2022, 08:44 AM
Here is the question:
Get sentences from Stanza Document objects.
The directory
Assume that we want to study the linguistic features of introductory sentences in Estonian Wikipedia.
The first sentence in each Document object in the list
Collect the second sentence of each Document object into a list named
When trying to run
I am really clueless here and wrote some random steps. But I assume it shouldn't be too hard but somehow just can't get it right. Please offer any hint if you happen to know this! Thank you!
Get sentences from Stanza Document objects.
The directory
data
contains a file named docs.pkl
, which contains 10 articles from the Estonian Wikipedia that have been processed using Stanza. The code provided in the cell below loads these Document objects and stores them into a list nameddocs
.Assume that we want to study the linguistic features of introductory sentences in Estonian Wikipedia.
The first sentence in each Document object in the list
docs
corresponds to the title of the article, which means we must retrieve the second sentence in the Document.Collect the second sentence of each Document object into a list named
intros
.When trying to run
docs
, it returns a dictionary-like list which is segmented by each token and its corresponding linguistic annotations. I wanted to make a text-like segmented doc object where I can apply docs.sents/docs.sentenceattributes directly however I don't know how.
I am really clueless here and wrote some random steps. But I assume it shouldn't be too hard but somehow just can't get it right. Please offer any hint if you happen to know this! Thank you!
# Import the 'pickle' module from Python for serializing data import pickle import spacy import spacy_stanza # Open the file with pickled Stanza Document objects for reading with open('data/docs.bin', mode='rb') as f: # Load the pickled Documents and assign under variable 'docs' docs = pickle.load(f) # Write your answer below this line. Please enter your entire solution in this cell. intros = [] #for doc in docs: docs[1].text