I made a simple function to get text from .docx files. Works OK.
Now I have a file containing a text frame at the top, a little text and a big table.
I can't get the text from the table or the text frame. Any tips on how that might be achieved?
import docx
def getText(filename):
print(len(doc.paragraphs))
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
myfile = input('Enter the full path to the file you want ... ')
text = getText(myfile)
This only gets free-standing text, not text in the frame or the table.[
attachment=1945]
please supply some sample data (something that causes error)
I found another module docx2txt which it says can get text from tables as well.
Haven't had time to try it yet!
Workaround: save the .docx as text, use readlines(), then get the lines I want. Works!
This finds tables in a document and converts them to dataframes.
from docx import Document
from docx.document import Document as _Document
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
import pandas as pd
def tables(parent):
if isinstance(parent, _Document):
element = parent.element.body
elif isinstance(parent, _Cell):
element = parent._tc
for child in element.iterchildren():
if isinstance(child, CT_Tbl):
table = Table(child, parent)
data = [[cell.text for cell in row.cells] for row in table.rows]
yield pd.DataFrame(data[1:], columns=data[0])
for table in tables(Document('data.docx')):
print(table, "\n")
I made a word document with multiple paragraphs and two tables. The output from running the program accurately shows the two tables.
Output:
A B C D
0 1 3 5 7
1 2 4 6 8
A B C
0 1 2 3
1 4 5 6
Thanks!
I looked in the docx docs but didn't see that information.
Hope you don't mind if I copy your code!
The docx library doesn't have a lot of documentation, nor do I think it should. If you want to know about the docx file format, read the Microsoft documentation. I copied a lot of my example from a couple of posts on the web. Thinking there must be an easier way to parse through the document info I wrote a small program to open a document then used interactive Python to look at all the document parts. Lo and behold, Document has an attribute "tables" that is a list of all tables in the document! That sure makes things easy.
from docx import Document
import pandas as pd
for table in Document("test.docx").tables:
data = [[cell.text for cell in row.cells] for row in table.rows]
print(pd.DataFrame(data[1:], columns=data[0]), "\n")
Thanks for the info!
I always use Libre Office, except when I need to interact with Python, then I save things in Excel or Word format.
It is noticeable that Excel or Word documents are smaller than their Libre Office counterparts.
Once, I made a typo while saving and saved a document like this: test.docx_
Do that, then look at the file, you have a zip file. Open that.
The document.xml contains the text, just need a parser to get the text from the <w:t>
<w:t>Hello me.</w:t>
Any other stuff must be in there, just a matter of parsing the right xml tags.
Python probably has an xml parser!
Pedroski55 Wrote:Python probably has an xml parser!
FYI:
Python does indeed have an XML parser
Search python docs
https://docs.python.org/3/library/index.html to see if there is a builtin package
(you can use search box a top of this page, or search subject matter)
For XML, this will give you:
https://docs.python.org/3/library/xml.html
Also, if you're looking for a particular python package by subject, go to
https://pypi.org/ and search for that subject
Or: do a google search: for example: "most popular python package for XML"
then search PyPi for package name (top for XML was 'lxml') at
https://pypi.org/
example: from command line for lxml:
https://pypi.org/search/?q=lxml
or simply search for subject from pypi homepage search box
use filters to hone down list.
If there is a GitHub page for the chosen package, you can get there by clicking on Homepage Icon
if not, may guide you to package homepage (for lxml, link is:
https://lxml.de/ )
Thanks!
I'll look for a suitable xml parser and try to make my own docx miner!