![]() |
Use module docx to get text from a file with a table - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Use module docx to get text from a file with a table (/thread-38053.html) |
Use module docx to get text from a file with a table - Pedroski55 - Aug-28-2022 I made a simple function to get text from .docx files. Works OK. Now I have a file containing a text frame at the top, a little text and a big table. I can't get the text from the table or the text frame. Any tips on how that might be achieved? import docx def getText(filename): print(len(doc.paragraphs)) doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText) myfile = input('Enter the full path to the file you want ... ') text = getText(myfile)This only gets free-standing text, not text in the frame or the table.[attachment=1945] RE: Use module docx to get text from a file with a table - Larz60+ - Aug-28-2022 please supply some sample data (something that causes error) RE: Use module docx to get text from a file with a table - Pedroski55 - Aug-28-2022 I found another module docx2txt which it says can get text from tables as well. Haven't had time to try it yet! Workaround: save the .docx as text, use readlines(), then get the lines I want. Works! RE: Use module docx to get text from a file with a table - deanhystad - Aug-29-2022 This finds tables in a document and converts them to dataframes. from docx import Document from docx.document import Document as _Document from docx.oxml.table import CT_Tbl from docx.table import _Cell, Table import pandas as pd def tables(parent): if isinstance(parent, _Document): element = parent.element.body elif isinstance(parent, _Cell): element = parent._tc for child in element.iterchildren(): if isinstance(child, CT_Tbl): table = Table(child, parent) data = [[cell.text for cell in row.cells] for row in table.rows] yield pd.DataFrame(data[1:], columns=data[0]) for table in tables(Document('data.docx')): print(table, "\n")I made a word document with multiple paragraphs and two tables. The output from running the program accurately shows the two tables.
RE: Use module docx to get text from a file with a table - Pedroski55 - Aug-29-2022 Thanks! I looked in the docx docs but didn't see that information. Hope you don't mind if I copy your code! RE: Use module docx to get text from a file with a table - deanhystad - Aug-29-2022 The docx library doesn't have a lot of documentation, nor do I think it should. If you want to know about the docx file format, read the Microsoft documentation. I copied a lot of my example from a couple of posts on the web. Thinking there must be an easier way to parse through the document info I wrote a small program to open a document then used interactive Python to look at all the document parts. Lo and behold, Document has an attribute "tables" that is a list of all tables in the document! That sure makes things easy. from docx import Document import pandas as pd for table in Document("test.docx").tables: data = [[cell.text for cell in row.cells] for row in table.rows] print(pd.DataFrame(data[1:], columns=data[0]), "\n") RE: Use module docx to get text from a file with a table - Pedroski55 - Aug-30-2022 Thanks for the info! I always use Libre Office, except when I need to interact with Python, then I save things in Excel or Word format. It is noticeable that Excel or Word documents are smaller than their Libre Office counterparts. Once, I made a typo while saving and saved a document like this: test.docx_ Do that, then look at the file, you have a zip file. Open that. The document.xml contains the text, just need a parser to get the text from the <w:t> <w:t>Hello me.</w:t> Any other stuff must be in there, just a matter of parsing the right xml tags. Python probably has an xml parser! RE: Use module docx to get text from a file with a table - Larz60+ - Aug-30-2022 Pedroski55 Wrote:Python probably has an xml parser!FYI: Python does indeed have an XML parser Search python docs https://docs.python.org/3/library/index.html to see if there is a builtin package (you can use search box a top of this page, or search subject matter) For XML, this will give you: https://docs.python.org/3/library/xml.html Also, if you're looking for a particular python package by subject, go to https://pypi.org/ and search for that subject Or: do a google search: for example: "most popular python package for XML" then search PyPi for package name (top for XML was 'lxml') at https://pypi.org/ example: from command line for lxml: https://pypi.org/search/?q=lxml or simply search for subject from pypi homepage search box use filters to hone down list. If there is a GitHub page for the chosen package, you can get there by clicking on Homepage Icon if not, may guide you to package homepage (for lxml, link is: https://lxml.de/ ) RE: Use module docx to get text from a file with a table - Pedroski55 - Aug-30-2022 Thanks! I'll look for a suitable xml parser and try to make my own docx miner! |