Oct-09-2019, 11:57 AM
It was asked in a forum I visit daily for help to extract some table data from a docx file. Nothing big. There was a couple of pages with bash scripting but I was looking for a Python solution. First, it was just a text file and after a little more discussion it was told us that the file is actually a docx document. So I tried python-docx module and I was disappointed. It was easy to use but it managed to get less than a third of the data. The rest wes an empty string. Don't know why. Perhaps encoding...? I don't know. After a few attempts, I gave up. The document is on my own language of course.
So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it.
Importing just xml and zipfile modules and the job was done in 30 lines of code.
I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution.
Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module.
So I started with quick research and it turned out that docx format is just a zip file with a bunch of xml files in it.
Importing just xml and zipfile modules and the job was done in 30 lines of code.
I was asked myself why I did even bother with python-docx at the first time. I lost two hours looking for a solution.
Yes, it would have been easier if it had worked. But it didn't. Because all of that now I know how to use zipfile module and how to search for data in an xml file. I know what is docx and how to deal with it. I am satisfied. Not because of the python-docx module.