Python Forum

Full Version: Scraping the page without distorting content
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am faced with the task of scraping a website page in such a way that its offline browser display matches the original page as closely as possible. At the same time, the content should not be changed or distorted (with the exception of setting up links). Thus, JS execution using Selenuim and / or browser add-ons does not help here (content is distorted). Is there a Python library that can help solve this problem? Example: in JS, the CSS address is programmatically calculated and then loaded into inline CSS.
I have not tried it, but there is this package: https://pypi.org/project/pywebcopy/
which promises to give you a complete copy (caveat: adheres to robots.txt)
Thanks a lot, I'll try it now.
Unfortunately, it only works in the simplest cases.
If you don't have a lot of webpages to save, Firefox 'File->Save Page As' will save entire page with all images, etc needed to reproduce.
I have a large list of pages and need an automatic mode. I can use 'File-> Save Page As' or the Webscrapbook browser add-on with Selenuim, but both methods, unfortunately, distort the content, since execute JS and save the rendered page, and don't configure JS for local execution.