webscrapping links and then enter those links to scrape data

kirito85 · Jun-12-2019, 08:29 AM

Hi, I am having difficulty trying to scrap webscrapping links and then enter those links to scrape data.

The webscrapping links is done but to enter those links from the first time webscrapping and entering those links inside and then collect data is another difficulty.

I have attached my first tier codes but i cant figure out the second tier codes.

Appreciate any kind help.

Thanks.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from urllib.request import urlopen,urlparse, Request,HTTPError
import urllib
import re
import numpy as np
import csv
from http.client import BadStatusLine
import ssl
import json
#from googlesearch import search

class Google:
    @classmethod
    def search1(self, search):
      url_list = []   #store all the extracted urls in a List
      title_list = [] #store all the extracted titles in a List
      description_list = []  #store all the extracted Description in a List
      all_links = []
 
      for start in range(0,10):
        #page = requests.get('https://www.google.com/search?rlz=1C1CHBF_enSG851SG851&ei=Nib2XI6FEcmLvQS1xb-wBQ&q=site%3Alinkedin.com+inurl%3Ain+%7C+inurl%3Apub+%7C+inurl%3Aprofile+-inurl%3Adir+-inurl%3Atitle+-inurl%3Agroups+-inurl%3Acompany+-inurl%3Ajobs+-inurl%3Ajobs2+VP&oq=site%3Alinkedin.com+inurl%3Ain+%7C+inurl%3Apub+%7C+inurl%3Aprofile+-inurl%3Adir+-inurl%3Atitle+-inurl%3Agroups+-inurl%3Acompany+-inurl%3Ajobs+-inurl%3Ajobs2'+search+str(start*10), verify = False)
        page = requests.get('http://www.google.com/search?q='+search+str(start*10), verify = False, timeout=5)  
     
        #page = requests.get('https://www.google.com/search?q='+search, verify = True)
        soup = BeautifulSoup(page.content, "lxml")
        #soup = BeautifulSoup(page.content)

        
        for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): #original working code
            a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))    
            a = a[0].split("&")[0]       
            url_list.append(a)

***metulburr*** · Jun-12-2019, 12:51 PM

You would loop your url_list and create a request for each one in there.

kirito85 · Jun-13-2019, 02:23 AM

Hi metulburr, thanks for your quick reply. I'll work on it, thanks alot.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Trying to scrape data from HTML with no identifiers	pythonpaul32	2	856	Dec-02-2023, 03:42 AM Last Post: pythonpaul32
	Webscrapping sport betting websites	KoinKoin	3	5,462	Nov-08-2023, 03:00 PM Last Post: LoriBrown
	Help Scraping links and table from link	cartonics	11	1,578	Oct-12-2023, 06:42 AM Last Post: cartonics
	webscrapping links from pandas dataframe	Wolverin	2	2,292	Aug-28-2023, 12:07 PM Last Post: Gaurav_Kumar
	How to extract links from grid located on webpage	Pavel_47	5	1,463	Aug-04-2023, 12:43 PM Last Post: Gaurav_Kumar
	I am trying to scrape data to broadcast it on Telegram	BarryBoos	1	2,117	Jun-10-2023, 02:36 PM Last Post: snippsat
	All product links to products on a website	MarionStorm	0	1,087	Jun-02-2022, 11:17 PM Last Post: MarionStorm
	Send a requests to a magnet links shortener which doesn't have APIs	Ascalon	3	2,185	Feb-20-2022, 04:50 PM Last Post: snippsat
	extract javascript links	Larz60+	0	1,793	Feb-16-2022, 10:49 AM Last Post: Larz60+
	How can I target and scrape a data-stat	never5000	5	2,810	Feb-11-2022, 07:59 PM Last Post: snippsat

webscrapping links and then enter those links to scrape data

User Panel Messages

Announcements