Hello everybody,
As a part of a project I need to make a corpus that - later on - will be used to train an NLP-model for named entity recognition etc.
I'm a little bit stuck with my basic knowledge from python-programming, or let's say programming in general right now.
What I have available: A folder with a bunch of Word documents (.docx)
What I did: a lot of research on working with documents in python, "walking" through the files in a folder, (nested) for loops in python, lists, dictionaries, how to get access to attributes from the documents, what is "JSON", etc. + ofcourse i've tried to write some working code already as well:
If I go in terminal in linux and I open python there and I write:
I hope I've given enough information and that everything in my explanation is clear.
Thank you very much in advance to take your time for this :-)
As a part of a project I need to make a corpus that - later on - will be used to train an NLP-model for named entity recognition etc.
I'm a little bit stuck with my basic knowledge from python-programming, or let's say programming in general right now.
What I have available: A folder with a bunch of Word documents (.docx)
What I did: a lot of research on working with documents in python, "walking" through the files in a folder, (nested) for loops in python, lists, dictionaries, how to get access to attributes from the documents, what is "JSON", etc. + ofcourse i've tried to write some working code already as well:
# In Pycharm: import os import docx path = '/path/to/FolderWithDocuments' for filename in os.listdir(path): if filename.endswith('.docx'): path = os.path.join(path, filename)This code "walks" over the Word documents in the folder. From these word documents i'll have to extract the raw text as well as the title (and btw, the title is not the first line in the document, the title is the name of the document itself). I've been told I should figure out how i need to do this with python-docx but other tips and ideas are also welcome. Idealy i'd end up with a json-file which later can be used in NLP.
If I go in terminal in linux and I open python there and I write:
import docx d=docx.Document('NameOfDocument.docx') dir(d) texts=[p.text for p in d.paragraphs] textsThen my output is written like this:
['','','some tekst in the document', '', 'some more text', ........ , '\t\t\tStW some more text', '','','','']And at this moment I really have no clue, which steps I'm going to need to take to go from this to a nice corpus to use for NLP. I have the feeling that I'm going to need to make a list of lists and then make a dictionary from these lists and in the end converting this dictionary into a JSON-file somehow but i'm really stuck here. Maybe there is an easier way, because I'm getting confused with this making a list of lists and then how am I going to do this while looping over this batch of documents...
I hope I've given enough information and that everything in my explanation is clear.
Thank you very much in advance to take your time for this :-)