Python Forum
Use module docx to get text from a file with a table
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Use module docx to get text from a file with a table
#1
I made a simple function to get text from .docx files. Works OK.

Now I have a file containing a text frame at the top, a little text and a big table.

I can't get the text from the table or the text frame. Any tips on how that might be achieved?

import docx

def getText(filename):
    print(len(doc.paragraphs))
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

myfile = input('Enter the full path to the file you want ... ')
text = getText(myfile)
This only gets free-standing text, not text in the frame or the table.
.docx   table_Adverbs_ly.docx (Size: 7.32 KB / Downloads: 140)
Reply
#2
please supply some sample data (something that causes error)
Reply
#3
I found another module docx2txt which it says can get text from tables as well.

Haven't had time to try it yet!

Workaround: save the .docx as text, use readlines(), then get the lines I want. Works!
Reply
#4
This finds tables in a document and converts them to dataframes.
from docx import Document
from docx.document import Document as _Document
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
import pandas as pd

def tables(parent):
    if isinstance(parent, _Document):
        element = parent.element.body
    elif isinstance(parent, _Cell):
        element = parent._tc

    for child in element.iterchildren():
        if isinstance(child, CT_Tbl):
            table = Table(child, parent)
            data = [[cell.text for cell in row.cells] for row in table.rows]
            yield pd.DataFrame(data[1:], columns=data[0])

for table in tables(Document('data.docx')):
    print(table, "\n")
I made a word document with multiple paragraphs and two tables. The output from running the program accurately shows the two tables.
Output:
A B C D 0 1 3 5 7 1 2 4 6 8 A B C 0 1 2 3 1 4 5 6
Pedroski55 likes this post
Reply
#5
Thanks!

I looked in the docx docs but didn't see that information.

Hope you don't mind if I copy your code!
Reply
#6
The docx library doesn't have a lot of documentation, nor do I think it should. If you want to know about the docx file format, read the Microsoft documentation. I copied a lot of my example from a couple of posts on the web. Thinking there must be an easier way to parse through the document info I wrote a small program to open a document then used interactive Python to look at all the document parts. Lo and behold, Document has an attribute "tables" that is a list of all tables in the document! That sure makes things easy.
from docx import Document
import pandas as pd

for table in Document("test.docx").tables:
    data = [[cell.text for cell in row.cells] for row in table.rows]
    print(pd.DataFrame(data[1:], columns=data[0]), "\n")
Reply
#7
Thanks for the info!

I always use Libre Office, except when I need to interact with Python, then I save things in Excel or Word format.

It is noticeable that Excel or Word documents are smaller than their Libre Office counterparts.

Once, I made a typo while saving and saved a document like this: test.docx_

Do that, then look at the file, you have a zip file. Open that.

The document.xml contains the text, just need a parser to get the text from the <w:t>

<w:t>Hello me.</w:t>

Any other stuff must be in there, just a matter of parsing the right xml tags.

Python probably has an xml parser!
Reply
#8
Pedroski55 Wrote:Python probably has an xml parser!
FYI:

Python does indeed have an XML parser
Search python docs https://docs.python.org/3/library/index.html to see if there is a builtin package
(you can use search box a top of this page, or search subject matter)
For XML, this will give you: https://docs.python.org/3/library/xml.html

Also, if you're looking for a particular python package by subject, go to https://pypi.org/ and search for that subject

Or: do a google search: for example: "most popular python package for XML"
then search PyPi for package name (top for XML was 'lxml') at https://pypi.org/
example: from command line for lxml: https://pypi.org/search/?q=lxml
or simply search for subject from pypi homepage search box
use filters to hone down list.

If there is a GitHub page for the chosen package, you can get there by clicking on Homepage Icon
if not, may guide you to package homepage (for lxml, link is: https://lxml.de/ )
Reply
#9
Thanks!

I'll look for a suitable xml parser and try to make my own docx miner!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  no module named 'docx' when importing docx MaartenRo 1 894 Dec-31-2023, 11:21 AM
Last Post: deanhystad
  Replace a text/word in docx file using Python Devan 4 3,467 Oct-17-2023, 06:03 PM
Last Post: Devan
  Color a table cell based on specific text Creepy 11 2,012 Jul-27-2023, 02:48 PM
Last Post: deanhystad
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 939 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,133 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  python-docx regex: replace any word in docx text Tmagpy 4 2,249 Jun-18-2022, 09:12 AM
Last Post: Tmagpy
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,696 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  How to perform DESC table sort on dates stored as TEXT type. hammer 7 2,235 Mar-15-2022, 01:10 PM
Last Post: hammer
  Converted Pipe Delimited text file to CSV file atomxkai 4 7,018 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Yahoo_fin, Pandas: how to convert data table structure in csv file detlefschmitt 14 7,807 Feb-15-2021, 12:58 PM
Last Post: detlefschmitt

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020