Python Forum

Hi,

Is it possible to do a webscrape on a website and see the new add on's with link?

I'm here talking about if someone changes an image or adds a new page to their website?

If so, how? Think

Yes,can eg save the the state of want to check to disk.

from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt'] # Girl in a jacket
with open('img_tag.txt', 'w') as f_out:
    f_out.write(img_tag)

Then check like this.

from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Apple in snow" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt']
with open('img_tag.txt') as f:
    old_tag = f.read()
    if old_tag == img_tag:
        print('No update')
    else:
        print(f'New image update: <{img_tag}>')

Output:
New image update: <Apple in snow>

Can run manually or automate in a schedule way eg Python job scheduling for humans.

Thank you for the quick reply.

What if my task is as following:

1) Webscrape/scan https://www.ditur.dk/herre/sale/herreure#'s code for links with ''linktoken'' in the address e.g.

<button type="button" data-linktoken="xFXOjSAR3FTGeEVN" data-goto="https://www.ditur.dk/dissing-mk9-black-f...AR3FTGeEVN" class="jatak-campaign jatak-campaign--easteregg btn-link">
<img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498">
</button>

2) Go to this link and scan for the price "799" in the code e.g.

<div class="jataktilbud--price">799 kr</div>

3) If the price is 799 open the website in a browser or just show the link in the console

4) Keep 1-3 looping

A lot of links will be shown from 1) but only one will be shown from 2) at a random time in the day

Any ideas / point me in the directions of functions/methods which could solve the above...
- Have some experience with java, but not with python, yet...

Thank you in advance for your input

(Apr-14-2022, 09:06 AM)kingoman123 Wrote: [ -> ]- Have some experience with java, but not with python, yet...

If want doing stuff like this most learn web-scraping in Python eg look at Web-Scraping part-1

kingoman123 Wrote Wrote:What if my task is as following:

The price is generated bye JavaScript,so most use other tool like Selenium.
To give example as the site is not easy as also most click on a couple of buttons then it generate the price.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
browser.get("https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN")
accept_button = browser.find_element(By.CSS_SELECTOR, 'button.coi-banner__accept')
accept_button.click()
close_button = browser.find_element(By.CSS_SELECTOR, 'div.fancybox-overlay.fancybox-overlay-fixed.ajaxcart-modal--overlay > div > div > a')
close_button.click()
time.sleep(5)
price = browser.find_element(By.CSS_SELECTOR, 'div.jataktilbud--price')
print(price.text)

Output:
799 kr

Look at this Thread for setup in Selenium v4.

Here a other way that catch the json response of a Ajax call.
This is more advanced stuff and have to look at what network send on web-site.

import requests

cookies = {
    'frontend': 'nb67okc1ltt5bvdrp9r3sm96fo',
    'frontend_cid': 'HUhHAlYWk0Ydp7rT',
}
params = {'product_id': '17227'}
response = requests.get('https://www.ditur.dk/jataktilbud/ajax/getPriceForProductAndToken', params=params, cookies=cookies)
print(response.json()['html'])

Output:<div class="jataktilbud jataktilbud-type-easteregg">
  <div class="jataktilbud--price">799 kr</div>
  <div class="jataktilbud--title">Tillykke!! Du har fundet et påskeæg. Du sparer 70%! <span class="jatak-campaign jatak-campaign--easteregg">
      <img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498" />
    </span>
  </div>
</div>

Also to do this need the for original url need the product_id.

import requests
from bs4 import BeautifulSoup

url = 'https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
prod_id = soup.select_one('#product-collection-image-17227')
print(prod_id.attrs.get('id').split('-')[-1])

Output:
17227

kingoman123

snippsat

kingoman123

snippsat

snippsat