Python Forum

After exporting from PDF to Word (via), the file contains many unnecessary artifacts in the text:
- meaningless parts of images
- empty tables of 1 cell

The following solution based on Python and Google Colab needs to be implemented:
1. I add a Word file using the “upload” button.
2. Display thumbnails of all images and tables in the interface (in 1 copy) with a check mark next to it.
3. I uncheck unnecessary images and tables.
4. I confirm.
5. Images and tables that I unchecked are deleted from the Word file.
6. Auto-download the Word file to your PC.

Script for removing images and tables from Word
https://colab.research.google.com/drive/...oKxazRKrbj
Sample file
https://disk.yandex.ru/i/VQkZzn7LQflE1Q

The table is displayed in the interface.
It does not display images and figures from Word. Please help me find the error. What should I fix?

Access to the file was opened by Google Collab.

I have never used Colab, so I have no idea what that does.

You can parse xml files using BeautifulSoup. Word documents are saved as compressed xml files.

Take your Word document and remove the last 2 letters from .docx so it looks like mydocument.do. Now your OS should recognise this as a zip file.

Unpack the zip file to a folder like temp. Look in temp. You will see a folder called mydocument. Look in mydocument and you will see a folder called word.

Look in word and you will find a file document.xml, which contains all the text, images, tables and nearly all settings for your file mydocument.docx. If you double click on document.xml, it should open in your browser. Have a look at it.

Images are stored in word/media/

It won't be simple at first, but you can learn to edit xml, find what you want and change or remove it.

There are other Python tools for editing xml.

from bs4 import BeautifulSoup

path2xml = 'docx/docxFiles/temp/testme/word/document.xml'

# read the xml
with open(path2xml, 'r') as f:
    data = f.read()

# read data with bs
bs_data = BeautifulSoup(data, "xml")


# find all images
images = bs_data.find_all('drawing')

rownong

rownong

Pedroski55