Python Forum

Full Version: Use module docx to get text from a file with a table
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I made a simple function to get text from .docx files. Works OK.

Now I have a file containing a text frame at the top, a little text and a big table.

I can't get the text from the table or the text frame. Any tips on how that might be achieved?

import docx

def getText(filename):
    print(len(doc.paragraphs))
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

myfile = input('Enter the full path to the file you want ... ')
text = getText(myfile)
This only gets free-standing text, not text in the frame or the table.[attachment=1945]
please supply some sample data (something that causes error)
I found another module docx2txt which it says can get text from tables as well.

Haven't had time to try it yet!

Workaround: save the .docx as text, use readlines(), then get the lines I want. Works!
This finds tables in a document and converts them to dataframes.
from docx import Document
from docx.document import Document as _Document
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
import pandas as pd

def tables(parent):
    if isinstance(parent, _Document):
        element = parent.element.body
    elif isinstance(parent, _Cell):
        element = parent._tc

    for child in element.iterchildren():
        if isinstance(child, CT_Tbl):
            table = Table(child, parent)
            data = [[cell.text for cell in row.cells] for row in table.rows]
            yield pd.DataFrame(data[1:], columns=data[0])

for table in tables(Document('data.docx')):
    print(table, "\n")
I made a word document with multiple paragraphs and two tables. The output from running the program accurately shows the two tables.
Output:
A B C D 0 1 3 5 7 1 2 4 6 8 A B C 0 1 2 3 1 4 5 6
Thanks!

I looked in the docx docs but didn't see that information.

Hope you don't mind if I copy your code!
The docx library doesn't have a lot of documentation, nor do I think it should. If you want to know about the docx file format, read the Microsoft documentation. I copied a lot of my example from a couple of posts on the web. Thinking there must be an easier way to parse through the document info I wrote a small program to open a document then used interactive Python to look at all the document parts. Lo and behold, Document has an attribute "tables" that is a list of all tables in the document! That sure makes things easy.
from docx import Document
import pandas as pd

for table in Document("test.docx").tables:
    data = [[cell.text for cell in row.cells] for row in table.rows]
    print(pd.DataFrame(data[1:], columns=data[0]), "\n")
Thanks for the info!

I always use Libre Office, except when I need to interact with Python, then I save things in Excel or Word format.

It is noticeable that Excel or Word documents are smaller than their Libre Office counterparts.

Once, I made a typo while saving and saved a document like this: test.docx_

Do that, then look at the file, you have a zip file. Open that.

The document.xml contains the text, just need a parser to get the text from the <w:t>

<w:t>Hello me.</w:t>

Any other stuff must be in there, just a matter of parsing the right xml tags.

Python probably has an xml parser!
Pedroski55 Wrote:Python probably has an xml parser!
FYI:

Python does indeed have an XML parser
Search python docs https://docs.python.org/3/library/index.html to see if there is a builtin package
(you can use search box a top of this page, or search subject matter)
For XML, this will give you: https://docs.python.org/3/library/xml.html

Also, if you're looking for a particular python package by subject, go to https://pypi.org/ and search for that subject

Or: do a google search: for example: "most popular python package for XML"
then search PyPi for package name (top for XML was 'lxml') at https://pypi.org/
example: from command line for lxml: https://pypi.org/search/?q=lxml
or simply search for subject from pypi homepage search box
use filters to hone down list.

If there is a GitHub page for the chosen package, you can get there by clicking on Homepage Icon
if not, may guide you to package homepage (for lxml, link is: https://lxml.de/ )
Thanks!

I'll look for a suitable xml parser and try to make my own docx miner!