Python Forum
How to remove unwanted images and tables from a Word file using Python?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to remove unwanted images and tables from a Word file using Python?
#1
After exporting from PDF to Word (via), the file contains many unnecessary artifacts in the text:
- meaningless parts of images
- empty tables of 1 cell

The following solution based on Python and Google Colab needs to be implemented:
1. I add a Word file using the “upload” button.
2. Display thumbnails of all images and tables in the interface (in 1 copy) with a check mark next to it.
3. I uncheck unnecessary images and tables.
4. I confirm.
5. Images and tables that I unchecked are deleted from the Word file.
6. Auto-download the Word file to your PC.

Script for removing images and tables from Word
https://colab.research.google.com/drive/...oKxazRKrbj
Sample file
https://disk.yandex.ru/i/VQkZzn7LQflE1Q

The table is displayed in the interface.
It does not display images and figures from Word. Please help me find the error. What should I fix?
Larz60+ write Feb-02-2025, 10:00 AM:
Rather than pasting links,
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply
#2
Access to the file was opened by Google Collab.
Reply
#3
I have never used Colab, so I have no idea what that does.

You can parse xml files using BeautifulSoup. Word documents are saved as compressed xml files.

Take your Word document and remove the last 2 letters from .docx so it looks like mydocument.do. Now your OS should recognise this as a zip file.

Unpack the zip file to a folder like temp. Look in temp. You will see a folder called mydocument. Look in mydocument and you will see a folder called word.

Look in word and you will find a file document.xml, which contains all the text, images, tables and nearly all settings for your file mydocument.docx. If you double click on document.xml, it should open in your browser. Have a look at it.

Images are stored in word/media/

It won't be simple at first, but you can learn to edit xml, find what you want and change or remove it.

There are other Python tools for editing xml.

from bs4 import BeautifulSoup

path2xml = 'docx/docxFiles/temp/testme/word/document.xml'

# read the xml
with open(path2xml, 'r') as f:
    data = f.read()

# read data with bs
bs_data = BeautifulSoup(data, "xml")


# find all images
images = bs_data.find_all('drawing')
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Looping through each images in a give folder Python druva 1 1,005 Jan-01-2025, 08:46 AM
Last Post: Pedroski55
  Better python library to create ER Diagram by using pandas data frames as tables klllmmm 0 3,185 Oct-19-2023, 01:01 PM
Last Post: klllmmm
  Replace a text/word in docx file using Python Devan 4 22,873 Oct-17-2023, 06:03 PM
Last Post: Devan
Question Unwanted execution of unittest ThomasFab 9 4,314 Nov-15-2022, 05:33 PM
Last Post: snippsat
  find some word in text list file and a bit change to them RolanRoll 3 2,420 Jun-27-2022, 01:36 AM
Last Post: RolanRoll
  Removing the unwanted data from a file jehoshua 14 7,017 Feb-01-2022, 09:56 PM
Last Post: jehoshua
  Creating file with images BobSmoss 1 2,041 Jan-08-2022, 08:46 PM
Last Post: snippsat
Question Problem: Check if a list contains a word and then continue with the next word Mangono 2 3,677 Aug-12-2021, 04:25 PM
Last Post: palladium
  HELP on Unwanted CSV Export Output | Using Selenium to Scrape soothsayerpg 0 1,811 Jun-13-2021, 12:23 PM
Last Post: soothsayerpg
  Problems with inserting images into an Excel File FightingFarmer 2 4,559 May-12-2021, 10:03 PM
Last Post: FightingFarmer

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020