Python Forum
Scrape for html based on url string and output into csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrape for html based on url string and output into csv
#11
So far you have helped me to put together below code - thank you for that.

import csv
import requests
import datetime
import time

from requests import get
from bs4 import BeautifulSoup


with open('data.csv', encoding='utf8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    next(reader)

    count = 0
    
    for row in reader:
        
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        url = f'https://www.somedomain.com/result?country=en&q={row[1]}'
        
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
        cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
        content = get(url, headers=headers,  cookies=cookies).content
        soup = BeautifulSoup(content, "lxml")

        table_info = soup.select_one('.table-info')

        mail = table_info.select_one('.col-2 a[href^=mailto]')
        mail = mail.get('href')
        mail_clean = mail.split(':')[1]

        website = soup.find(text='Website:')
        website = table_info.select_one('.col-2 a[target^=_blank]')
        website = website.get('href') 
        
        collected_data = row[1], mail_clean, website, timestamp

        data_list = [["Regcode", "Email", "Website", "Timestamp"],collected_data]
        with open('extracted.csv', 'w', newline='') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerows(data_list)

        print(row[1], "|", mail_clean,"|", website,"|", timestamp)
        #print("Waiting 3 seconds...")
        #time.sleep(3)
        count+=1

        
The code sort of works, but there are problems with the csv writing part.
I am not sure how to write new line into csv without overwriting the same line over and over again. I messed up the loop and I am not sure how to fix it.

Second thing is that, Its need to to crawl only new entries in the future (every week), so I think it also need to check the extracted.csv every time to avoid duplicate content before It will put a new line into the extracted.csv.

I hope you can give me a hint.
Thank You buddy.
Reply


Messages In This Thread
RE: Scrape for html based on url string and output into csv - by dana - Jan-12-2021, 08:11 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to scrape data from HTML with no identifiers pythonpaul32 2 915 Dec-02-2023, 03:42 AM
Last Post: pythonpaul32
Lightbulb Python Obstacles | Kung-Fu | Full File HTML Document Scrape and Store it in MariaDB BrandonKastning 5 2,965 Dec-29-2021, 02:26 AM
Last Post: BrandonKastning
  Python Obstacles | Karate | HTML/Scrape Specific Tag and Store it in MariaDB BrandonKastning 8 3,232 Nov-22-2021, 01:38 AM
Last Post: BrandonKastning
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,702 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Pandas tuple list returning html string shansaran 0 1,757 Mar-23-2020, 08:44 PM
Last Post: shansaran
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,402 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,280 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  How do I get rid of the HTML tags in my output? glittergirl 1 3,763 Aug-05-2019, 08:30 PM
Last Post: snippsat
  Formatting Output after Web Scrape yoitspython 2 2,506 Jul-30-2019, 08:39 PM
Last Post: yoitspython
  Basic Syntax/HTML Scrape Questions sungar78 5 3,839 Sep-06-2018, 09:32 PM
Last Post: sungar78

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020