Python Forum
Data extraction from (multiple) MS Word file(s) in python
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Data extraction from (multiple) MS Word file(s) in python
#1
Hello everyone!

I'm learning Python programming as a part of my thesis subject so I'm inexperienced at this moment because I just started. (I do have some experience in other programming languages like C, java, C# anyways)

The first few things I need to do in this project is like this:
I have a bunch of MS Word-files (.docx) and I need to extract all the text from these files so I can feed this text to spacy NLP to be able to annotate using named entity recognition.

First I started to figure out how to extract text from 1 Word document. I figured out that python-docx would be the easiest way to use for that so I did it like this:
import docx

     def getText(filename):
     doc = docx.Document(filename)
     fullText = []
     for para in doc.paragraphs:
          fullText.append(para.text)
     return '\n'.join(fullText)
(which i've found in this source: link)

But I wanted to test if this actualy works so I added this to my program:

def main():
     txt = getText('worddocument.docx')
     print(txt)
As a result this doesn't print anything, doesn't give me an output and I have no what I'm actualy doing wrong and if this is the right way to extract text from a word document.

Also, I do actually need to figure out what is the best way to extract text from multiple documents. I already googled on that topic but I have no plan about that yet. I guess I'll probably need to use a for-loop to use this getText-function on all of the documents in a certain folder, to extract all of the text from these documents but that's for later. First things first.

Anyone any ideas/recommendations?
Reply
#2
Is that the whole program? You don't seem to be calling main anywhere. If that isn't the whole program, please don't make us guess as to what code you have. Either post the whole thing, or a minimal, yet, complete example that demonstrates the problem.
Reply
#3
Hello,

I'm sorry, I probably lack too much of experience in Python programming so your response made me figure out -while watching another YouTube-tutorial- that I don't necessary need a main() in my program at all.

But even if i remove the "def main():", that doesn't really make a difference anyway...

By far, this is all I have for code... I have a bunch of docx files in one folder and the text from all of them needs to be extracted but I think I should first figure out how to extract from 1 file, so being able to print the text from one file as a proof that the code works - I think.

I believe my next step is to watch more youtube tutorials on programming in Python. Programming in C, Java, C# is all at least 1,5 years ago anyways...

But about this program above: I didn't just copy paste without thinking. I know what every line of the code should do but I guess there is some detail where I'm missing out because I don't get it why it doesn't give me the text from the word document as an output :/
Reply
#4
Quote:But even if i remove the "def main():", that doesn't really make a difference anyway...

By far, this is all I have for code... I have a bunch of docx files in one folder and the text from all of them needs to be extracted but I think I should first figure out how to extract from 1 file, so being able to print the text from one file as a proof that the code works - I think.

I believe my next step is to watch more youtube tutorials on programming in Python. Programming in C, Java, C# is all at least 1,5 years ago anyways...
Agree with the proof of concept as the next step, and the tutorials on Youtube and here are good at getting a start. I'll put in half a plug - I like the Head First series of books, fits my sense of humor. I did not like that much of the Head First Python book used Flask for HTML presentation rather than looking at the other GUI interfaces. That said, prior to the push to Flask that is a good source.

One issue is that you have defined a function but did not call it. Even when you had a main() function, you did not call that. Look at your first post, put the two code blocks together but then add one line at the bottom
main()
That would actually call main which would call gettext and you could test your code.
Of course, at that point you would need to save the code and execute the module. Which IDE are you using?
Reply
#5
Damn, I see that i'm really making a beginner-mistake here.
I tried something new that gave me some output already but it's still not what it's suposed to be:

The definition of the getText function stays thesame and after that I wrote this:
def main():
    print("the main is executed")
    t = getText('paragraphtest.docx')
    print(t)


main()
this gives me as output:
Quote:The main is executed
This is paragraph 1.

While the paragraphtest.docx is actualy a document with this text inside:
Quote:This is paragraph 1.
This is paragraph 2.
This is paragraph 3.

(I made it just to test if there isn't something wrong with the word-document I'm using itself)

It looks like there's something wrong with the word documents which I'm supposed to use as input. Originaly they are .doc files that I'm supposed to automaticly convert to .docx files. The automaticly-part of that is something that took me a while to figure out how to do it so to save time I did a manual conversion but... I guess that is where something went wrong :/
But anyway, I'm wondering why getText only prints one paragraph for me now.

(using pycharm btw)
Reply
#6
your function works just fine for me - with simple file with 3 paragraphs
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
(Sep-16-2019, 11:37 AM)buran Wrote: your function works just fine for me - with simple file with 3 paragraphs

But why doesn't it extract and print the 3 paragraphs instead of only the first one?
Reply
#8
Check your docx file. In my case it print all 3 para
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#9
Hello everybody,

As a part of a project I need to make a corpus that - later on - will be used to train an NLP-model for named entity recognition etc.
I'm a little bit stuck with my basic knowledge from python-programming, or let's say programming in general right now.

What I have available: A folder with a bunch of Word documents (.docx)
What I did: a lot of research on working with documents in python, "walking" through the files in a folder, (nested) for loops in python, lists, dictionaries, how to get access to attributes from the documents, what is "JSON", etc. + ofcourse i've tried to write some working code already as well:

# In Pycharm:
import os
import docx

path = '/path/to/FolderWithDocuments'
for filename in os.listdir(path):
    if filename.endswith('.docx'):
        path = os.path.join(path, filename)
This code "walks" over the Word documents in the folder. From these word documents i'll have to extract the raw text as well as the title (and btw, the title is not the first line in the document, the title is the name of the document itself). I've been told I should figure out how i need to do this with python-docx but other tips and ideas are also welcome. Idealy i'd end up with a json-file which later can be used in NLP.

If I go in terminal in linux and I open python there and I write:
import docx
d=docx.Document('NameOfDocument.docx')
dir(d)
texts=[p.text for p in d.paragraphs]
texts
Then my output is written like this:
['','','some tekst in the document', '', 'some more text', ........ , '\t\t\tStW some more text', '','','','']
And at this moment I really have no clue, which steps I'm going to need to take to go from this to a nice corpus to use for NLP. I have the feeling that I'm going to need to make a list of lists and then make a dictionary from these lists and in the end converting this dictionary into a JSON-file somehow but i'm really stuck here. Maybe there is an easier way, because I'm getting confused with this making a list of lists and then how am I going to do this while looping over this batch of documents...

I hope I've given enough information and that everything in my explanation is clear.
Thank you very much in advance to take your time for this :-)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Read data from a CSV file in S3 bucket and store it in a dictionary in python Rupini 3 6,883 May-15-2020, 04:57 PM
Last Post: snippsat
  Multiple XML file covert to CSV output file krish143 1 3,310 Jul-27-2018, 06:55 PM
Last Post: ichabod801
  Login Module Help - Comparing data in a text file to data held in a variable KeziaKar 0 2,222 Mar-08-2018, 11:41 AM
Last Post: KeziaKar

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020