Hi,
Is it possible to do a webscrape on a website and see the new add on's with link?
I'm here talking about if someone changes an image or adds a new page to their website?
If so, how?

Yes,can eg save the the state of want to check to disk.
from bs4 import BeautifulSoup
html = '''\
<body>
<h1>The img element</h1>
<img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt'] # Girl in a jacket
with open('img_tag.txt', 'w') as f_out:
f_out.write(img_tag)
Then check like this.
from bs4 import BeautifulSoup
html = '''\
<body>
<h1>The img element</h1>
<img src="img_girl.jpg" alt="Apple in snow" width="500" height="600">
</body>
</html>'''
soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt']
with open('img_tag.txt') as f:
old_tag = f.read()
if old_tag == img_tag:
print('No update')
else:
print(f'New image update: <{img_tag}>')
Output:
New image update: <Apple in snow>
Can run manually or automate in a schedule way eg
Python job scheduling for humans.
Thank you for the quick reply.
What if my task is as following:
1) Webscrape/scan
https://www.ditur.dk/herre/sale/herreure#'s code for links with ''linktoken'' in the address e.g.
<button type="button" data-linktoken="xFXOjSAR3FTGeEVN" data-goto="
https://www.ditur.dk/dissing-mk9-black-f...AR3FTGeEVN" class="jatak-campaign jatak-campaign--easteregg btn-link">
<img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498">
</button>
2) Go to this link and scan for the price "799" in the code e.g.
<div class="jataktilbud--price">799 kr</div>
3) If the price is 799 open the website in a browser or just show the link in the console
4) Keep 1-3 looping
A lot of links will be shown from 1) but only one will be shown from 2) at a random time in the day
Any ideas / point me in the directions of functions/methods which could solve the above...
- Have some experience with java, but not with python, yet...
Thank you in advance for your input
(Apr-14-2022, 09:06 AM)kingoman123 Wrote: [ -> ]- Have some experience with java, but not with python, yet...
If want doing stuff like this most learn web-scraping in Python eg look at
Web-Scraping part-1
kingoman123 Wrote Wrote:What if my task is as following:
The price is generated bye JavaScript,so most use other tool like
Selenium.
To give example as the site is not easy as also most click on a couple of buttons then it generate the price.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
#--| Setup
options = Options()
#options.add_argument("--headless")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
browser.get("https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN")
accept_button = browser.find_element(By.CSS_SELECTOR, 'button.coi-banner__accept')
accept_button.click()
close_button = browser.find_element(By.CSS_SELECTOR, 'div.fancybox-overlay.fancybox-overlay-fixed.ajaxcart-modal--overlay > div > div > a')
close_button.click()
time.sleep(5)
price = browser.find_element(By.CSS_SELECTOR, 'div.jataktilbud--price')
print(price.text)
Output:
799 kr
Look at this
Thread for setup in Selenium v4.
Here a other way that catch the json response of a Ajax call.
This is more advanced stuff and have to look at what network send on web-site.
import requests
cookies = {
'frontend': 'nb67okc1ltt5bvdrp9r3sm96fo',
'frontend_cid': 'HUhHAlYWk0Ydp7rT',
}
params = {'product_id': '17227'}
response = requests.get('https://www.ditur.dk/jataktilbud/ajax/getPriceForProductAndToken', params=params, cookies=cookies)
print(response.json()['html'])
Output:
<div class="jataktilbud jataktilbud-type-easteregg">
<div class="jataktilbud--price">799 kr</div>
<div class="jataktilbud--title">Tillykke!! Du har fundet et påskeæg. Du sparer 70%! <span class="jatak-campaign jatak-campaign--easteregg">
<img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498" />
</span>
</div>
</div>
Also to do this need the for original url need the
product_id
.
import requests
from bs4 import BeautifulSoup
url = 'https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
prod_id = soup.select_one('#product-collection-image-17227')
print(prod_id.attrs.get('id').split('-')[-1])
Output:
17227