web scraping for new additions/modifed website?

kingoman123 · Apr-11-2022, 07:16 PM

Hi,

Is it possible to do a webscrape on a website and see the new add on's with link?

I'm here talking about if someone changes an image or adds a new page to their website?

If so, how? Think

***snippsat*** · Apr-11-2022, 08:57 PM

Yes,can eg save the the state of want to check to disk.

from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt'] # Girl in a jacket
with open('img_tag.txt', 'w') as f_out:
    f_out.write(img_tag)

Then check like this.

from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Apple in snow" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt']
with open('img_tag.txt') as f:
    old_tag = f.read()
    if old_tag == img_tag:
        print('No update')
    else:
        print(f'New image update: <{img_tag}>')

Output:
New image update: <Apple in snow>

Can run manually or automate in a schedule way eg Python job scheduling for humans.

kingoman123 · Apr-14-2022, 09:06 AM

Thank you for the quick reply.

What if my task is as following:

1) Webscrape/scan https://www.ditur.dk/herre/sale/herreure#'s code for links with ''linktoken'' in the address e.g.

<button type="button" data-linktoken="xFXOjSAR3FTGeEVN" data-goto="https://www.ditur.dk/dissing-mk9-black-f...AR3FTGeEVN" class="jatak-campaign jatak-campaign--easteregg btn-link">
<img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498">
</button>

2) Go to this link and scan for the price "799" in the code e.g.

<div class="jataktilbud--price">799 kr</div>

3) If the price is 799 open the website in a browser or just show the link in the console

4) Keep 1-3 looping

A lot of links will be shown from 1) but only one will be shown from 2) at a random time in the day

Any ideas / point me in the directions of functions/methods which could solve the above...
- Have some experience with java, but not with python, yet...

Thank you in advance for your input

***snippsat*** · (This post was last modified: Apr-14-2022, 12:39 PM by snippsat.)

(Apr-14-2022, 09:06 AM)kingoman123 Wrote: - Have some experience with java, but not with python, yet...

If want doing stuff like this most learn web-scraping in Python eg look at Web-Scraping part-1

kingoman123 Wrote Wrote:What if my task is as following:

The price is generated bye JavaScript,so most use other tool like Selenium.
To give example as the site is not easy as also most click on a couple of buttons then it generate the price.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
browser.get("https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN")
accept_button = browser.find_element(By.CSS_SELECTOR, 'button.coi-banner__accept')
accept_button.click()
close_button = browser.find_element(By.CSS_SELECTOR, 'div.fancybox-overlay.fancybox-overlay-fixed.ajaxcart-modal--overlay > div > div > a')
close_button.click()
time.sleep(5)
price = browser.find_element(By.CSS_SELECTOR, 'div.jataktilbud--price')
print(price.text)

Output:
799 kr

Look at this Thread for setup in Selenium v4.

***snippsat*** · Apr-14-2022, 04:46 PM

Here a other way that catch the json response of a Ajax call.
This is more advanced stuff and have to look at what network send on web-site.

import requests

cookies = {
    'frontend': 'nb67okc1ltt5bvdrp9r3sm96fo',
    'frontend_cid': 'HUhHAlYWk0Ydp7rT',
}
params = {'product_id': '17227'}
response = requests.get('https://www.ditur.dk/jataktilbud/ajax/getPriceForProductAndToken', params=params, cookies=cookies)
print(response.json()['html'])

Output:<div class="jataktilbud jataktilbud-type-easteregg">
  <div class="jataktilbud--price">799 kr</div>
  <div class="jataktilbud--title">Tillykke!! Du har fundet et påskeæg. Du sparer 70%! <span class="jatak-campaign jatak-campaign--easteregg">
      <img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498" />
    </span>
  </div>
</div>

Also to do this need the for original url need the product_id.

import requests
from bs4 import BeautifulSoup

url = 'https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
prod_id = soup.select_one('#product-collection-image-17227')
print(prod_id.attrs.get('id').split('-')[-1])

Output:
17227

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Website Scraping Problems	JamesWilson	1	817	Jul-01-2024, 09:46 AM Last Post: Larz60+
	Scraping lender data from Ren Ren Dai website using Python. I will pay for that 200$	Hafedh_2021	1	3,431	May-18-2021, 08:41 PM Last Post: snippsat
	Scraping all website text using Python	MKMKMKMK	1	2,748	Nov-26-2020, 10:35 PM Last Post: Larz60+
	Scraping a Website (HELP)	LearnPython2	1	2,348	May-08-2020, 03:20 PM Last Post: Larz60+
	scraping from a website that hides source code	PIWI_Protein	1	2,642	Mar-27-2020, 05:08 PM Last Post: Larz60+
	Scraping not moving to the next pages in a website	jithin123	0	2,463	Mar-23-2020, 06:10 PM Last Post: jithin123
	Random Loss of Control of Website When Scraping	bmccollum	0	2,004	Aug-30-2019, 04:04 AM Last Post: bmccollum
	MaxRetryError while scraping a website multiple times	kawasso	6	20,977	Aug-29-2019, 05:25 PM Last Post: kawasso
	scraping multiple pages of a website.	Blue Dog	14	25,254	Jun-21-2018, 09:03 PM Last Post: Blue Dog
	Scraping number in % from website	santax	3	5,249	Mar-19-2017, 12:22 PM Last Post: santax

web scraping for new additions/modifed website?

User Panel Messages

Announcements