Python Forum

Full Version: How to extract links from grid located on webpage
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

How to extract links from this url:
Une sélection de concerts électroniques et électrisants

I tried different tags (using selenium) ... nothing works.
Thanks in advance.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

url = 'https://www.arte.tv/fr/videos/RC-019798/electro-chillout/'
pattern = '.css-zqu0w1'
pattern_class = 'css-1tqpy7w'
pattern_class = 'css-1wbmdb2'
pattern_css = 'div.css-1tqpy7w:nth-child(1) > a:nth-child(1)'
pattern_class1 = 'css-1wbmdb2 [herf]'
#div.css-1tqpy7w:nth-child(2) > a:nth-child(1)
pattern_id = 'teaserItemLink'

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)
driver.get(url)
grid = driver.find_elements(By.CLASS_NAME, pattern_class)
for item in grid:
    print(item.text)

aaa = driver.find_elements(By.CLASS_NAME, pattern_class1)
print(aaa)

bbb = driver.find_elements(By.ID, pattern_id)
print(bbb)
Like this,and copy CSS selector from browse,then get the correct selector.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

#--| Setup
options = Options()
options.add_argument("--headless")
ser = Service(r"C:\cmder\bin\chromedriver.exe")
browser = webdriver.Chrome(service=ser, options=options)
#--| Parse or automation
url = 'https://www.arte.tv/fr/videos/RC-019798/electro-chillout/'
browser.get(url)
time.sleep(2)
link_1 =  browser.find_element(By.CSS_SELECTOR, '#__next > div > main > div.css-pb7yb6 > div > div:nth-child(1)')
>>> print(link_1.text)
Regarder Superpoze ARTE Concert Festival 2022 52 min
52 min
Superpoze
ARTE Concert Festival 2022
Link 2 will be div:nth-child(2)
>>> print(link_2.text)
Regarder La Fine Equipe Fête de l’Humanité 2020 60 min
60 min
La Fine Equipe
Fête de l’Humanité 2020
What I'm looking for are links ... not text:
[Image: arte-links-in-grid.jpg]
Use get_attribute() to get the href attribute.
link = browser.find_element(By.CSS_SELECTOR, '#__next > div > main > div.css-pb7yb6 > div > div:nth-child(1) > a')
>>> link.get_attribute('href')
'https://www.arte.tv/fr/videos/110984-006-A/superpoze/'
Thanks !
It seems that you are trying to extract links from the given URL using Selenium and the Chrome WebDriver. However, there are some issues in the way you are trying to locate the elements.

Use By.CSS_SELECTOR to locate the elements by the CSS selector. The class 'css-1wbmdb2' seems to be the correct class that contains the links.Then find all anchor elements (links) within this class and extract their 'href' attribute to get the URLs.

Here is have updated your code:-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

url = 'https://www.arte.tv/fr/videos/RC-019798/electro-chillout/'

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)
driver.get(url)

# Find all anchor elements within the specific class 'css-1wbmdb2'
links = driver.find_elements(By.CSS_SELECTOR, '.css-1wbmdb2 a')

# Extract and print the href attribute of each link
for link in links:
    href = link.get_attribute('href')
    print(href)

driver.quit()