Sep-15-2019, 11:47 AM
Hello everyone!
I'm learning Python programming as a part of my thesis subject so I'm inexperienced at this moment because I just started. (I do have some experience in other programming languages like C, java, C# anyways)
The first few things I need to do in this project is like this:
I have a bunch of MS Word-files (.docx) and I need to extract all the text from these files so I can feed this text to spacy NLP to be able to annotate using named entity recognition.
First I started to figure out how to extract text from 1 Word document. I figured out that python-docx would be the easiest way to use for that so I did it like this:
But I wanted to test if this actualy works so I added this to my program:
Also, I do actually need to figure out what is the best way to extract text from multiple documents. I already googled on that topic but I have no plan about that yet. I guess I'll probably need to use a for-loop to use this getText-function on all of the documents in a certain folder, to extract all of the text from these documents but that's for later. First things first.
Anyone any ideas/recommendations?
I'm learning Python programming as a part of my thesis subject so I'm inexperienced at this moment because I just started. (I do have some experience in other programming languages like C, java, C# anyways)
The first few things I need to do in this project is like this:
I have a bunch of MS Word-files (.docx) and I need to extract all the text from these files so I can feed this text to spacy NLP to be able to annotate using named entity recognition.
First I started to figure out how to extract text from 1 Word document. I figured out that python-docx would be the easiest way to use for that so I did it like this:
import docx def getText(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText)(which i've found in this source: link)
But I wanted to test if this actualy works so I added this to my program:
def main(): txt = getText('worddocument.docx') print(txt)As a result this doesn't print anything, doesn't give me an output and I have no what I'm actualy doing wrong and if this is the right way to extract text from a word document.
Also, I do actually need to figure out what is the best way to extract text from multiple documents. I already googled on that topic but I have no plan about that yet. I guess I'll probably need to use a for-loop to use this getText-function on all of the documents in a certain folder, to extract all of the text from these documents but that's for later. First things first.
Anyone any ideas/recommendations?