Scraping the page without distorting content

Scraping the page without distorting content - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Scraping the page without distorting content (/thread-35792.html)

Scraping the page without distorting content - oleglpts - Dec-15-2021

I am faced with the task of scraping a website page in such a way that its offline browser display matches the original page as closely as possible. At the same time, the content should not be changed or distorted (with the exception of setting up links). Thus, JS execution using Selenuim and / or browser add-ons does not help here (content is distorted). Is there a Python library that can help solve this problem? Example: in JS, the CSS address is programmatically calculated and then loaded into inline CSS.

RE: Scraping the page without distorting content - Larz60+ - Dec-15-2021

I have not tried it, but there is this package: https://pypi.org/project/pywebcopy/
which promises to give you a complete copy (caveat: adheres to robots.txt)

RE: Scraping the page without distorting content - oleglpts - Dec-16-2021

Thanks a lot, I'll try it now.

RE: Scraping the page without distorting content - oleglpts - Dec-16-2021

Unfortunately, it only works in the simplest cases.

RE: Scraping the page without distorting content - Larz60+ - Dec-16-2021

If you don't have a lot of webpages to save, Firefox 'File->Save Page As' will save entire page with all images, etc needed to reproduce.

RE: Scraping the page without distorting content - oleglpts - Dec-16-2021

I have a large list of pages and need an automatic mode. I can use 'File-> Save Page As' or the Webscrapbook browser add-on with Selenuim, but both methods, unfortunately, distort the content, since execute JS and save the rendered page, and don't configure JS for local execution.