Feb-04-2025, 08:30 AM
I have never used Colab, so I have no idea what that does.
You can parse xml files using BeautifulSoup. Word documents are saved as compressed xml files.
Take your Word document and remove the last 2 letters from .docx so it looks like mydocument.do. Now your OS should recognise this as a zip file.
Unpack the zip file to a folder like temp. Look in temp. You will see a folder called mydocument. Look in mydocument and you will see a folder called word.
Look in word and you will find a file document.xml, which contains all the text, images, tables and nearly all settings for your file mydocument.docx. If you double click on document.xml, it should open in your browser. Have a look at it.
Images are stored in word/media/
It won't be simple at first, but you can learn to edit xml, find what you want and change or remove it.
There are other Python tools for editing xml.
You can parse xml files using BeautifulSoup. Word documents are saved as compressed xml files.
Take your Word document and remove the last 2 letters from .docx so it looks like mydocument.do. Now your OS should recognise this as a zip file.
Unpack the zip file to a folder like temp. Look in temp. You will see a folder called mydocument. Look in mydocument and you will see a folder called word.
Look in word and you will find a file document.xml, which contains all the text, images, tables and nearly all settings for your file mydocument.docx. If you double click on document.xml, it should open in your browser. Have a look at it.
Images are stored in word/media/
It won't be simple at first, but you can learn to edit xml, find what you want and change or remove it.
There are other Python tools for editing xml.
from bs4 import BeautifulSoup path2xml = 'docx/docxFiles/temp/testme/word/document.xml' # read the xml with open(path2xml, 'r') as f: data = f.read() # read data with bs bs_data = BeautifulSoup(data, "xml") # find all images images = bs_data.find_all('drawing')