Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 How to exclude certain links while webscraping basis on keywords

I am trying to webscrape google search results. My job is to find and scrape google search results basis on keywords have been provided to me in a csv file.

what my codes do are they find those keywords in google and get first three links. But it also scrapes some links which are not required.

I have another csv file which is negative list which has certain keywords now if while scraping codes find any google search results whose text is also available on negative list it should avoid it and do not add it into database.

for example if if finds and negative list has justdial, it should not add it into database

Till now i received no success.

below are my codes:

from selenium import webdriver 
from import By 
from import WebDriverWait 
from import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from import Options
import csv
import time
from itertools import groupby,chain
from operator import itemgetter
import sqlite3

final_data = []
def getresults():
    global final_data
    conn = sqlite3.connect("Jobs_data.db")
    conn.execute("""CREATE TABLE IF NOT EXISTS naukri(id INTEGER PRIMARY KEY, KEYWORD text, LINK text,
                            CONSTRAINT number_unique UNIQUE (KEYWORD,LINK))
    cur = conn.cursor()
    #chrome_options = Options()
    #chrome_options.binary_location = '/Applications/Google Chrome Chrome Canary'
    driver = webdriver.Chrome("./chromedriver")
    with open("./"+"terms12.csv", "r") as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            keywords = row[0]
                url = "" + keywords
                count = 0
                links = driver.find_elements_by_class_name("g")[:3]
                for i in links:
                    data = i.find_elements_by_class_name("iUh30")
                    dm = negativelist("junk.csv")
                    for i in data:     
                        sublist = []
                        data = i.text
                        if data.find("justdial")!=-1:
                        print("I am in exception")
                        cur.execute("INSERT OR IGNORE INTO naukri VALUES (NULL,?,?)",(keywords,data))
            except Exception as e:
                print("I am outside exception")
    return final_data

def negativelist(file):
    sublist = []
    with open("./"+file,"r") as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            _data = row[0]
    return sublist

def readfile(alldata, filename):
    with open ("./"+ filename, "w",encoding="utf-8") as csvfile:
        csvfile = csv.writer(csvfile, delimiter=",")
        for i in range(0, len(alldata)):
def main():
    readfile([[k, *chain.from_iterable(r for _, *r in g)] for k, g in groupby(final_data, key=itemgetter(0))], "Naukri.csv")

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Help with basic webscraping Captain_Snuggle 2 348 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  webscrapping links and then enter those links to scrape data kirito85 2 599 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Can't Resolve Webscraping AttributeError Hass 1 500 Jan-15-2019, 09:36 PM
Last Post: nilamo
  Webscraping homework Ghigo1995 1 594 Sep-23-2018, 07:36 PM
Last Post: nilamo
  Intro to WebScraping d1rjr03 2 1,015 Aug-15-2018, 12:05 AM
Last Post: metulburr
  Trying to add logs to webscraping module tjnichols 0 572 Jul-19-2018, 04:29 PM
Last Post: tjnichols
  webscraping yahoo data - custom date implementation Jens89 4 1,895 Jun-19-2018, 08:02 AM
Last Post: Jens89
  webscraping - failing to extract specific text from rontar 2 885 May-19-2018, 08:01 AM
Last Post: rontar
  Unable to print data while looping through list in csv for webscraping - Python Prince_Bhatia 1 1,333 Oct-04-2017, 11:18 AM
Last Post: wavic
  Learning WebScraping Prince_Bhatia 13 3,546 Aug-29-2017, 11:03 AM
Last Post: Prince_Bhatia

Forum Jump:

Users browsing this thread: 1 Guest(s)