Python Forum
Scraping the page without distorting content
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping the page without distorting content
#1
I am faced with the task of scraping a website page in such a way that its offline browser display matches the original page as closely as possible. At the same time, the content should not be changed or distorted (with the exception of setting up links). Thus, JS execution using Selenuim and / or browser add-ons does not help here (content is distorted). Is there a Python library that can help solve this problem? Example: in JS, the CSS address is programmatically calculated and then loaded into inline CSS.
Reply
#2
I have not tried it, but there is this package: https://pypi.org/project/pywebcopy/
which promises to give you a complete copy (caveat: adheres to robots.txt)
Reply
#3
Thanks a lot, I'll try it now.
Reply
#4
Unfortunately, it only works in the simplest cases.
Reply
#5
If you don't have a lot of webpages to save, Firefox 'File->Save Page As' will save entire page with all images, etc needed to reproduce.
Reply
#6
I have a large list of pages and need an automatic mode. I can use 'File-> Save Page As' or the Webscrapbook browser add-on with Selenuim, but both methods, unfortunately, distort the content, since execute JS and save the rendered page, and don't configure JS for local execution.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I am scraping a web page but got an Error Sarmad54 3 1,458 Mar-02-2023, 08:20 PM
Last Post: Sarmad54
  Python Web Scraping can not getting all HTML content yqqwe123 0 1,649 Aug-02-2021, 08:56 AM
Last Post: yqqwe123
  Scraping a page with log in data (security, proxies) iamaghost 0 2,153 Mar-27-2021, 02:56 PM
Last Post: iamaghost
  Scraping .aspx page Larz60+ 21 51,266 Mar-18-2021, 10:16 AM
Last Post: Larz60+
  Scraping Whole Page Source GJG 1 2,151 Jan-13-2021, 03:19 PM
Last Post: GJG
  Web Scraping Inquiry (Extracting content from a table in asubdomain) DustinKlent 3 3,736 Aug-17-2020, 10:10 AM
Last Post: snippsat
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,643 Mar-19-2020, 06:13 PM
Last Post: apollo
  Scraping next page of LinkedIn jobs RiteshMahto 6 6,431 Dec-09-2019, 09:43 PM
Last Post: Larz60+
  Scraping data from ebay seller page yuvalta 3 6,008 Sep-25-2019, 04:22 AM
Last Post: sandramoraes
  Django Two blocks of dynamic content on one page iFunKtion 5 4,430 Jul-04-2019, 02:31 AM
Last Post: noisefloor

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020