Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 webscrapping links and then enter those links to scrape data
#1
Hi, I am having difficulty trying to scrap webscrapping links and then enter those links to scrape data.

The webscrapping links is done but to enter those links from the first time webscrapping and entering those links inside and then collect data is another difficulty.

I have attached my first tier codes but i cant figure out the second tier codes.

Appreciate any kind help.

Thanks.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from urllib.request import urlopen,urlparse, Request,HTTPError
import urllib
import re
import numpy as np
import csv
from http.client import BadStatusLine
import ssl
import json
#from googlesearch import search

class Google:
    @classmethod
    def search1(self, search):
      url_list = []   #store all the extracted urls in a List
      title_list = [] #store all the extracted titles in a List
      description_list = []  #store all the extracted Description in a List
      all_links = []
 
      for start in range(0,10):
        #page = requests.get('https://www.google.com/search?rlz=1C1CHBF_enSG851SG851&ei=Nib2XI6FEcmLvQS1xb-wBQ&q=site%3Alinkedin.com+inurl%3Ain+%7C+inurl%3Apub+%7C+inurl%3Aprofile+-inurl%3Adir+-inurl%3Atitle+-inurl%3Agroups+-inurl%3Acompany+-inurl%3Ajobs+-inurl%3Ajobs2+VP&oq=site%3Alinkedin.com+inurl%3Ain+%7C+inurl%3Apub+%7C+inurl%3Aprofile+-inurl%3Adir+-inurl%3Atitle+-inurl%3Agroups+-inurl%3Acompany+-inurl%3Ajobs+-inurl%3Ajobs2'+search+str(start*10), verify = False)
        page = requests.get('http://www.google.com/search?q='+search+str(start*10), verify = False, timeout=5)  
     
        #page = requests.get('https://www.google.com/search?q='+search, verify = True)
        soup = BeautifulSoup(page.content, "lxml")
        #soup = BeautifulSoup(page.content)

        
        for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): #original working code
            a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))    
            a = a[0].split("&")[0]       
            url_list.append(a)   
Quote
#2
You would loop your url_list and create a request for each one in there.
Quote
#3
Hi metulburr, thanks for your quick reply. I'll work on it, thanks alot.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  webscrapping links from pandas dataframe Wolverin 1 135 Jun-19-2019, 11:22 PM
Last Post: Larz60+
  webscrapping lists to dataframe kirito85 3 233 Jun-10-2019, 06:55 AM
Last Post: kirito85
  Scrape ASPX data with python... hoff1022 0 556 Feb-26-2019, 06:16 PM
Last Post: hoff1022
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 341 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Scrape java script web site PythonHunger 6 620 Oct-25-2018, 05:59 AM
Last Post: PythonHunger
  Need To Scrape Some Links digitalmatic7 2 427 Oct-09-2018, 02:33 AM
Last Post: digitalmatic7
  Basic Syntax/HTML Scrape Questions sungar78 5 665 Sep-06-2018, 09:32 PM
Last Post: sungar78
  Selenium sends enter instead of space shinaco 1 415 Aug-22-2018, 10:06 PM
Last Post: shinaco
  Project: Opening all links within a web page Truman 0 395 Aug-07-2018, 12:33 AM
Last Post: Truman
  How do i scrape website whose page changes using javsacript _dopostback function and Prince_Bhatia 1 989 Aug-06-2018, 09:45 AM
Last Post: wavic

Forum Jump:


Users browsing this thread: 1 Guest(s)