Python Forum
web scraping for new additions/modifed website?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping for new additions/modifed website?
#1
Hi,

Is it possible to do a webscrape on a website and see the new add on's with link?

I'm here talking about if someone changes an image or adds a new page to their website?

If so, how? Think
Reply
#2
Yes,can eg save the the state of want to check to disk.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt'] # Girl in a jacket
with open('img_tag.txt', 'w') as f_out:
    f_out.write(img_tag)
Then check like this.
from bs4 import BeautifulSoup

html = '''\
<body>
  <h1>The img element</h1>
  <img src="img_girl.jpg" alt="Apple in snow" width="500" height="600">
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
img_tag = soup.select_one('img')['alt']
with open('img_tag.txt') as f:
    old_tag = f.read()
    if old_tag == img_tag:
        print('No update')
    else:
        print(f'New image update: <{img_tag}>')
Output:
New image update: <Apple in snow>
Can run manually or automate in a schedule way eg Python job scheduling for humans.
Reply
#3
Thank you for the quick reply.

What if my task is as following:

1) Webscrape/scan https://www.ditur.dk/herre/sale/herreure#'s code for links with ''linktoken'' in the address e.g.

<button type="button" data-linktoken="xFXOjSAR3FTGeEVN" data-goto="https://www.ditur.dk/dissing-mk9-black-f...AR3FTGeEVN" class="jatak-campaign jatak-campaign--easteregg btn-link">
<img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498">
</button>

2) Go to this link and scan for the price "799" in the code e.g.

<div class="jataktilbud--price">799&nbsp;kr</div>

3) If the price is 799 open the website in a browser or just show the link in the console

4) Keep 1-3 looping


A lot of links will be shown from 1) but only one will be shown from 2) at a random time in the day

Any ideas / point me in the directions of functions/methods which could solve the above...
- Have some experience with java, but not with python, yet...

Thank you in advance for your input
Reply
#4
(Apr-14-2022, 09:06 AM)kingoman123 Wrote: - Have some experience with java, but not with python, yet...
If want doing stuff like this most learn web-scraping in Python eg look at Web-Scraping part-1
kingoman123 Wrote Wrote:What if my task is as following:
The price is generated bye JavaScript,so most use other tool like Selenium.
To give example as the site is not easy as also most click on a couple of buttons then it generate the price.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
browser.get("https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN")
accept_button = browser.find_element(By.CSS_SELECTOR, 'button.coi-banner__accept')
accept_button.click()
close_button = browser.find_element(By.CSS_SELECTOR, 'div.fancybox-overlay.fancybox-overlay-fixed.ajaxcart-modal--overlay > div > div > a')
close_button.click()
time.sleep(5)
price = browser.find_element(By.CSS_SELECTOR, 'div.jataktilbud--price')
print(price.text)
Output:
799 kr
Look at this Thread for setup in Selenium v4.
Reply
#5
Here a other way that catch the json response of a Ajax call.
This is more advanced stuff and have to look at what network send on web-site.
import requests

cookies = {
    'frontend': 'nb67okc1ltt5bvdrp9r3sm96fo',
    'frontend_cid': 'HUhHAlYWk0Ydp7rT',
}
params = {'product_id': '17227'}
response = requests.get('https://www.ditur.dk/jataktilbud/ajax/getPriceForProductAndToken', params=params, cookies=cookies)
print(response.json()['html'])
Output:
<div class="jataktilbud jataktilbud-type-easteregg"> <div class="jataktilbud--price">799 kr</div> <div class="jataktilbud--title">Tillykke!! Du har fundet et påskeæg. Du sparer 70%! <span class="jatak-campaign jatak-campaign--easteregg"> <img src="https://www.ditur.dk/skin/frontend/ditur/default/jataktilbud/campaigns/easteregg/easter-egg.svg?mt=1617104498" /> </span> </div> </div>
Also to do this need the for original url need the product_id.
import requests
from bs4 import BeautifulSoup

url = 'https://www.ditur.dk/dissing-mk9-black-friday-limited-edition-d1456?linktoken=xFXOjSAR3FTGeEVN'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
prod_id = soup.select_one('#product-collection-image-17227')
print(prod_id.attrs.get('id').split('-')[-1])
Output:
17227
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scraping lender data from Ren Ren Dai website using Python. I will pay for that 200$ Hafedh_2021 1 2,756 May-18-2021, 08:41 PM
Last Post: snippsat
  Scraping all website text using Python MKMKMKMK 1 2,093 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Scraping a Website (HELP) LearnPython2 1 1,762 May-08-2020, 03:20 PM
Last Post: Larz60+
  scraping from a website that hides source code PIWI_Protein 1 1,974 Mar-27-2020, 05:08 PM
Last Post: Larz60+
  Scraping not moving to the next pages in a website jithin123 0 1,970 Mar-23-2020, 06:10 PM
Last Post: jithin123
  Random Loss of Control of Website When Scraping bmccollum 0 1,521 Aug-30-2019, 04:04 AM
Last Post: bmccollum
  MaxRetryError while scraping a website multiple times kawasso 6 17,487 Aug-29-2019, 05:25 PM
Last Post: kawasso
  scraping multiple pages of a website. Blue Dog 14 22,445 Jun-21-2018, 09:03 PM
Last Post: Blue Dog
  Scraping number in % from website santax 3 4,484 Mar-19-2017, 12:22 PM
Last Post: santax

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020