Python Forum

Full Version: Web scraper
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi everyone, so this is my first time ever using python script, and I would like to create an web scraper, to scrap prices, names, dates, and stuff like that from multiple websites, and my goal is to book the information into excel sheets. Is it possible? Also I would like to ask your help everyone and if possible teach me how to code it.
Welcome to the forum @tomenzo123 ,
(Feb-03-2022, 12:03 PM)tomenzo123 Wrote: [ -> ]Is it possible?
Yes, Python is a good choice for this.
(Feb-03-2022, 12:03 PM)tomenzo123 Wrote: [ -> ]Also I would like to ask your help everyone and if possible teach me how to code it.
Well, that is a bit difficult. You say you have no experience with Python. Do you have experience with other languages? It will take a few weeks to get on the desired level.
First you must start learning Python. So start with the tutorial. Ask questions on this forum if you get stuck.
Next you must learn how to scrape information from web pages. You can read about it in "Beautiful Soup Documentation".
Then write the data to Excel, well there are several ways to do so. But tell us when you are finished with the previous documents.
Okay will check out and try to code it
(Feb-03-2022, 12:03 PM)tomenzo123 Wrote: [ -> ]I would like to create an web scraper, to scrap prices, names, dates, and stuff like that from multiple websites, and my goal is to book the information into excel sheets
Here is post where i show a demo of this task.
So use common tools for scraping look at Web-Scraping part-1
Then Pandas is great for all kind data format and can eg export directly to Excel df.to_excel
Quote:Hi everyone, so this is my first time ever using python script
You should spend some time so have basic knowledge of Python,then link ibreeden is more on that topic.
I'm kind of stuck now, the code opens the website but doesn't record any data or creates excel file, what to do?
Use Code Tag when post code.
Your url has tag part,so that will never work.
driver.get = ("<a href=https://www.flipkart.com/laptops-store?otracker=nmenu_sub_Electronics_0_Laptops")
Your loop dos not work as a will be a emtypy list,you have test stuff like this a simple print('a') would show this.
Then of course you have fix this,and not write more stuff in the loop.
for a in soup.findAll('a', href=True, attrs={'class' : '_3YgSsQ'}):
    print(a)
Output:
[]
I can write a test as example to get some output.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.flipkart.com/laptops-store?otracker=nmenu_sub_Electronics_0_Laptops"
browser.get(url)
time.sleep(3)
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.find('title')
design_pc = soup.find('div', class_="_3ZYowz _2CfYpZ")
print(title.text)
design_pc_1 = design_pc.find(class_="_4ddWXP _3BCh3_")
print(design_pc_1.text)
Output:
Laptops - Biggest Deals on Laptops Online at Best Price in India | Flipkart.com Lenovo Ideapad Flex 5 Core i3 11th Gen - (8 GB/256 GB SSD/Windows...4.4(294)₹52,990₹75,49029% off
So I likely changed the code a bit, and not so sure what's wrong with it, tried searching for issues but nothing happening, would you mind looking at what I've done so far?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
import re
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import unicodedata
import codecs
import os
import time
import pandas as pd
import xlwt
import math
from xlwt.Workbook import *
from pandas import ExcelWriter
import xlsxwriter
from itertools import groupby
import datetime
import xlsxwriter

url = 'https://www.tui.lt/?departureCityId=389090&arrivalCountryId=18498&arrivalRegionIds=&arrivalCityIds=&hotelIds=&minStartDate=2022-02-09&maxStartDate=2022-02-22&minNightsCount=7&maxNightsCount=14&adults=2&children=&searchLevel=&isGeoInfoRequired=false&type=country'

r = requests.get(url, headers={"User-Agent": "Chrome"})
soup = BeautifulSoup(r.text, "html.parser")
xurl = soup.find_all("a", {"class": "paginator__page paginator__page--active"})


workbook = xlsxwriter.Workbook('Hoteliai.xlsx')
workpro = workbook.add_worksheet('HotelName')
workimg = workbook.add_worksheet('location')
workspec = workbook.add_worksheet('Features')
workrew = workbook.add_worksheet('Price')
pro_list = ["product_id", "name(en-gb)", "location", "adults", "children", "rate", "date", "price"]
workpro.write_row(0, 0, pro_list)
img_list = ["product_id", "image", "sort_order"]
workimg.write_row(0, 0, img_list)
atr_list = ["product_id", "attribute_group", "attribute", "text(en-gb)"]
row = 1
imgrow = 1
ocd = 1
rows = 1

category = 88
proid = 200
atrribute_group = "Tui deals"
location = "Egypt"
adults = 2
children = 0
rate = 4
date = 2022/2/22
price = "1000.00"

url = "https://www.tui.lt" + xurl[0]["href"] + "&page=" + str(ocd)
for ocd in range(1, 48):
  url = "https://www.tui.lt" + xurl[0]["href"] + "&page=" + str(ocd)
  m = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
  soup = BeautifulSoup(m.text, "html.parser")
  link = soup.find_all("a", {"class": "catalog-taxons-product__image-anchor"})
  nextlink = "https://www.tui.lt" + link[0]["href"]
  print("\n" "--------change page---------")
  print(url)
  print("--------change page---------")
  for number in range(48):
    nextlink = "https://www.tui.lt" + link[number]["href"]
    r = requests.get(nextlink, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(r.text, "html.parser")
    productcode = soup.findAll("div",{"class":"product-righter"})
    sku = re.sub('\D', '', productcode[0].p.text.strip())
    pavad = soup.find("h1").text.strip()
    print("\n" "--------++++++---------")
    print(nextlink)
    print("--------++++++---------")
    workpro.write(row, 0, proid)
    workpro.write(row, 1, category)
    workpro.write(row, 2, location)
    workpro.write(row, 3, adults)
    workpro.write(row, 4, children)
    workpro.write(row, 5, rate)
    workpro.write(row, 6, date)
    price = soup.findAll("span", {'class': 'price'})
    if price != []:
        pricenr = price[0].text.strip().replace('€', '').replace('vnt.', '').replace('\n', '').replace('/ ',
                                                                                                       '').replace(',',
                                                                                                                   '.')
    else:
        print("no price")
    containers = soup.findAll("div", {"class": "site-block inner-content"})
    table = soup.findAll("td")
    find = soup.find_all("div", {'class': 'ck-info-tooltip-wrap'})
    for div in find:
        div.decompose()
    syntaxes = ["Number of people", "date", "location", "price"]
    nodata = ""
workbook.close()
print("ALL DONE !")
Hope you all don't mind but I'm following this thread cause I'm a newbie to python and want to learn.advice on here is great thank you
If you are a beginner in Python technology then you should firstly clear your all basic concepts of python web development. After then you can switch to python web scrapping.

Then you can follow all these following steps to create a web scraper in Python to extract prices, names, dates, and other information from multiple websites and save the data into Excel sheets :-

1.> Choose a Web Scraping Library: Popular libraries for web scraping include Beautiful Soup and Scrapy. Beautiful Soup is more suitable for simple scraping tasks, while Scrapy is a more powerful framework for larger projects.

2.> Analyze the Website Structure: Inspect the HTML structure of the websites you want to scrape using your browser's developer tools. This will help you identify the HTML elements that contain the data you need.

3.> Write the Scraper Code: Use the chosen library to write Python code that fetches the HTML content of the web pages, parses the data, and extracts the desired information.

4.> Store Data in Excel: After extracting the data, you can use Python libraries like pandas or openpyxl to create and manipulate Excel files. You can organize the data into pandas dataframes and then export them to Excel format.