Python Forum
[Intermediate] Key Word Scrapper with Python and Selenium
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Intermediate] Key Word Scrapper with Python and Selenium
#1
Hey, every one. I've been writing these tutorials, mainly for my self to retain knowledge. I thought they might help some one else out. Please advise if theirs any problems,errors, etc..

Happy Holidays


TOPIC MI PYTHON : KW KEY WORD SEARCH BOT WITH PYTHON AND SELENIUM

Hey everybody been having fun working with Selenium in Python . Now lets start putting it all together. Unfortunately some web sites don't want you're bots doing weird stuff to them. So lets incorporate our web scrapper into the Anon redirect function we created earlier. That way at least our real IP shouldn't be banned.

This is a useful tool for SEO optimization.

First lets import everything we need.

import time
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Now instead of using functions individually by themselves. Let's make a class object as a container to organize all our functions into a more coherent data structure.

Below we define our class. Now, apparently a constructor is not called a constructor in python? For ease of my own understanding I'm just going to call the construction of a class a constructor for now. I could be way off with my terminology.

class KW_SEARCH_BOT(object):
	#   CONSTRUCTOR(?)
	def __init__(self,browser,search_engine_url,kw_list,package):
		self.browser = browser
		#   TEXT OF SCRAPE PAGE   UNUSED  UPON  INITIALIZATION
		self.package = package
		#   CLASS WIDE VAR FOR SEARCH ENGINE TO USE   UPON INITIALIZATION
		self.search_engine_url = search_engine_url
		#   CLASS WIDE VAR FOR KEY WORD LIST   UPON INITIALIZATION
		self.kw_list = kw_list
I called the class KW_SEARCH_BOT. You can call it whatever you want.
1.) We first construct the class with __init__ method.
2.) Then define the class wide variable "browser"
3.) "Package" variable is a blank list upon instantiation not used in this tutorial.
4.) Search engine is the search on we will be using to get our data
5.) Kw_list is our list of keywords.

Now we define the main function for our class.
def main(self):
	
		################    REDIRECT THROUGH KPROXY   #####################
		print("OPEN BROWSER")
		self.browser.get("http://www.kproxy.com")
		
		#   FIND ELEMENT
		print("FIND ELEMENT BY ID MAINTEXTFIELD AND SEND SEARCH ENGINE URL")
		elem = self.browser.find_element_by_id("maintextfield").clear()
		elem = self.browser.find_element_by_id("maintextfield").send_keys(self.search_engine_url)
		elem = self.browser.find_element_by_id("maintextfield").send_keys(Keys.ENTER)
		#  WAIT 2 SECONDS
		time.sleep(2)
The above code defines the main function of our class. The object is going to redirect through Kproxy and insert the search engine url of our preferred search engine.
1.) Define the main function self as an object reference to the object.
2.) Get the webdriver.Chrome() as the browser
3.) Find and clear the element "maintextfield"
4.) Print the search engine url in the field.
5.) Hit enter key
6.) Wait 2 seconds

Next we are going to define our main loop to loop through our keyword list.
###   MAIN LOOP
	    #   INJECT KEY WORD LIST
		
		#   LOOPS THROUGH KW_LIST AS KW
		for kw in self.kw_list:
			print("*" * 30)
			print(kw)
			print("*" * 30)
			print("INJECTING KEYWORD PAYLOAD")
			print("-" * 30)
			#   WAIT 3 SECONDS
			time.sleep(3)
			#   LOCATE ELEMENT AND CLEAR
			elem = self.browser.find_element_by_name("q").clear()
			#   LOCATE ELEMENT AND SEND KEYS :  KW VARIABLE
			elem = self.browser.find_element_by_name("q").send_keys(kw)
			#   PRESS ENTER
			elem = self.browser.find_element_by_name("q").send_keys(Keys.ENTER)
			#   WAIT 3 SECONDS
			time.sleep(3)
For the keyword(kw) in the keyword list(kw_list) we send that keyword to the search engine url through the anonimizer.

1.) Make the for loop.
2.) Print the key word currently being posted in the terminal.
3.)Wait 3 seconds.
4.).Locate element "q" and clear the field.
5.) Send the keyword from key word list.
6.) Press enter.
7.) Wait 3 seconds.

Next we are going to print the results of the search into the terminal
#   PRINT IN TERMINAL
			#   SEARCH RESULTS OF KEY WORD HITS
			hit_count = self.browser.find_element_by_class_name("sb_count")
			#   SEARCH RESULTS OF SNIPPET
			#snippet = self.browser.find_element_by_id("b_results")
			#   SEARCH RESULTS OF B_ALGO
			b_algo = self.browser.find_element_by_class_name("b_algo")
			#   TURN THEWEB OBJECT INTO TEXT AND ENCODE IN UTF-8
			b_algo_text = b_algo.text.encode("utf-8")
			print(b_algo_text)
			print(hit_count.text)
1.) hit_count equals the element with the hit count.
2.) b_algo equals the element with the description of the first item
3.) b_algo_text equals the web elements text encoded in utf-8
4.) Print both hit count and description web objects as text in the terminal.

Now lets save those same results to our local harddrive.
#   PRINT TO LOCAL FILE
			#   CREATE AND OPEN LOCAL FILE
			local_file = open(  "key_word_results.txt" , "a")
			#   WRITE TO LOCAL FILE KW VARIABLE
			local_file.write(",\n " + kw)
			#   WRITE TO LOCAL FILE HIT COUNT
			local_file.write(",\n " + hit_count.text)
			#   WRITE TO LOCAL FILE DESCRIPTION AS ENCODED STRING
			local_file.write(",\n " + str(b_algo.text.encode("utf-8")))
			local_file.write("\n " + "*" * 30)
			local_file.close()
1.) First we create and open the local file to edit as "key_word_results.txt","a")
2.) Write the current keyword with a line break
3.) Write the hit count with line break.
4.) Write the description as a string of text encoded in utf-8. I had problems with writing chinese characters or other encoding types. I found that this method works for this instance.
5.) Close the file until it loops through again.

Ok, almost done with our first key word scrapping; bot, script, daemon, whatever you feel like calling it. Lets define our key word list as a list variable. Next we need to Instantiate our class.
#   DEFINE KEW_WORDS LIST TO INPUT
key_words = ["search 1","search 2","search 3"]		
#   INSTANTIATE KW_POST_BOT AS bot
bot = KW_POST_BOT(webdriver.Chrome(),"http://www.bing.com",key_words,[])
Now, the list variable "key_words" has our search words in it. Put the key words you are interested in, in the list separated by a comma.
Instantiate the class object bot with Chrome as the web driver, Search engine url , key_words list , package list unused in this tutorial.

Call the main function of bot object
bot.main()
Time to run our program and see what happens. You should see a chrome browser window open up and go to the anonimzer url. Then the key words should be injected and the results saved both in terminal and locally.

print("*" * 30)
print("KW SEARCH BOT MIPython")
print("http://www.mipython.com")
print("*" * 30)

import time
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

#   DEFINE CLASS INHERIT FROM OBJECT 
class KW_POST_BOT(object):
	#   CONSTRUCTOR(?)
	def __init__(self,browser,search_engine_url,kw_list,package):
		self.browser = browser
		#   TEXT OF SCRAPE PAGE   UNUSED  UPON  INITIALIZATION
		self.package = package
		#   CLASS WIDE VAR FOR SEARCH ENGINE TO USE   UPON INITIALIZATION
		self.search_engine_url = search_engine_url
		#   CLASS WIDE VAR FOR KEY WORD LIST   UPON INITIALIZATION
		self.kw_list = kw_list
		
	def main(self):
	
		################    REDIRECT THROUGH KPROXY   #####################
		print("OPEN BROWSER")
		self.browser.get("http://www.kproxy.com")
		
		#   FIND ELEMENT
		print("FIND ELEMENT BY ID MAINTEXTFIELD AND SEND SEARCH ENGINE URL")
		elem = self.browser.find_element_by_id("maintextfield").clear()
		elem = self.browser.find_element_by_id("maintextfield").send_keys(self.search_engine_url)
		elem = self.browser.find_element_by_id("maintextfield").send_keys(Keys.ENTER)
		#  WAIT 2 SECONDS
		time.sleep(2)
		###                        MAIN LOOP
	    #   INJECT KEY WORD LIST
		
		#   LOOPS THROUGH KW_LIST AS KW
		for kw in self.kw_list:
			print("*" * 30)
			print(kw)
			print("*" * 30)
			print("INJECTING KEYWORD PAYLOAD")
			print("-" * 30)
			#   WAIT 3 SECONDS
			time.sleep(3)
			#   LOCATE ELEMENT AND CLEAR
			elem = self.browser.find_element_by_name("q").clear()
			#   LOCATE ELEMENT AND SEND KEYS :  KW VARIABLE
			elem = self.browser.find_element_by_name("q").send_keys(kw)
			#   PRESS ENTER
			elem = self.browser.find_element_by_name("q").send_keys(Keys.ENTER)
			#   WAIT 3 SECONDS
			time.sleep(3)
			
			#   PRINT IN TERMINAL
			#   SEARCH RESULTS OF KEY WORD HITS
			hit_count = self.browser.find_element_by_class_name("sb_count")
			#   SEARCH RESULTS OF SNIPPET
			#snippet = self.browser.find_element_by_id("b_results")
			#   SEARCH RESULTS OF B_ALGO
			b_algo = self.browser.find_element_by_class_name("b_algo")
			#   TURN THEWEB OBJECT INTO TEXT AND ENCODE IN UTF-8
			b_algo_text = b_algo.text.encode("utf-8")
			print(b_algo_text)
			print(hit_count.text)
			
			
			
			#   PRINT TO LOCAL FILE
			#   CREATE AND OPEN LOCAL FILE
			local_file = open( "date" + "_key_word_results.txt" , "a")
			#   WRITE TO LOCAL FILE KW VARIABLE
			local_file.write(",\n " + kw)
			#   WRITE TO LOCAL FILE HIT COUNT
			local_file.write(",\n " + hit_count.text)
			#   WRITE TO LOCAL FILE DESCRIPTION AS ENCODED STRING
			local_file.write(",\n " + str(b_algo.text.encode("utf-8")))
			local_file.write("\n " + "*" * 30)
			local_file.close()
			
			
#   DEFINE KEW_WORDS LIST TO INPUT
key_words = ["sample1","sample2","sample3"]		
#   INSTANTIATE KW_POST_BOT AS bot
bot = KW_POST_BOT(webdriver.Chrome(),"http://www.bing.com",key_words,[])



bot.main()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Selenium fields containing a word graham23s 2 1,791 Jul-23-2019, 10:44 PM
Last Post: graham23s
  Scrapper for websites stinger 0 1,887 Jul-20-2018, 02:11 AM
Last Post: stinger
  Error in Selenium: CRITICAL:root:Selenium module is not installed...Exiting program. AcszE 1 2,959 Nov-03-2017, 08:41 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020