How to use BeautifulSoup to parse google search results

DevinGP · Dec-21-2017, 04:25 PM

I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far:

from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests

address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

print(newString)

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

print(newWord)

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".

I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page:

https://ibb.co/jfRakR

Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.

Thank you!

wavic · Dec-21-2017, 05:04 PM

The URL is like this: https://google.com/search?q=python+hello+world+tutorial

You may add some other options

See this

DevinGP · Dec-21-2017, 06:24 PM

(Dec-21-2017, 05:04 PM)wavic Wrote: The URL is like this: https://google.com/search?q=python+hello+world+tutorial

You may add some other options

See this

Hello, the issue is not with making the URL, so far I have that working fine. The issue is with BeautifulSoup to parse data from said URL. I do not know the proper syntax on how to use .read() or .read_all() to get to the data that I want. (The title and summary).

**nilamo** · Dec-21-2017, 06:36 PM

Does this help?

soup.find_all("div.g")

DevinGP · Dec-21-2017, 06:53 PM

(Dec-21-2017, 06:36 PM)nilamo Wrote: Does this help?
soup.find_all("div.g")

Hello, thanks for the reply! When I do this:

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

results = soup.find_all("div.g")



print(results)

All it prints is "None". That was the problem I was having as well.

***metulburr*** · (This post was last modified: Dec-21-2017, 07:08 PM by metulburr.)

Quote:soup.find_all("div.g")

im pretty sure that find_all has no significance in a period, so it is actually searching for <div.g> tag

Im assuming you mean this?

soup.find_all('div', {'class':'g'})

or CSS selector soup.select('.g') i think. I havent checked it for verification.

***snippsat*** · Dec-21-2017, 07:11 PM

You are really making it difficult for yourself,google use a lot of JavaScript.
JavaScript is rendered in browser,
so when you see div class='g'(browser) it dos not mean that it will be download source(Requests can not render JavaScript).
Can try to use Selenium/PhantomJs,i did a quick test and even using those tool is difficult to parse the mess getting back.

So i would try to avoid parse result from a google search,
start train with something simpler.

DevinGP · Dec-21-2017, 07:27 PM

(Dec-21-2017, 07:07 PM)metulburr Wrote:
Quote:soup.find_all("div.g")
im pretty sure that find_all has no significance in a period, so it is actually searching for <div.g> tag

Im assuming you mean this?
soup.find_all('div', {'class':'g'})
or CSS selector soup.select('.g') i think. I havent checked it for verification.

Hey! Thank you for your reply! When I try this all it returns is:

"[]"

Thanks for the tip though, this has been really racking my brain.

***metulburr*** · (This post was last modified: Dec-21-2017, 07:34 PM by metulburr.)

Quote:When I try this all it returns is:

"[]"

Then it probably is using javscript and you are only left with selenium as an option.

I didnt know the results might be javascript though.

DevinGP · Dec-21-2017, 09:10 PM

(Dec-21-2017, 07:33 PM)metulburr Wrote:
Quote:When I try this all it returns is:

"[]"
Then it probably is using javscript and you are only left with selenium as an option.

I didnt know the results might be javascript though.

Do you mind telling me how I would implement Selenium into my current code or at least pointing me to a tutorial on someone using it to scrape the titles and summaries? Thank you!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Unable to convert browser generated xml to parse in BeautifulSoup	Nik1811	0	237	Mar-22-2024, 01:37 PM Last Post: Nik1811
	Using BeautifulSoup And Getting -1 Results	knight2000	10	2,972	Mar-07-2023, 02:42 PM Last Post: snippsat
	Web scraping for search results	JOE	7	3,263	May-14-2022, 01:19 PM Last Post: JOE
	With Selenium create a google Search list in Incognito mode withe specific location,	tsurubaso	3	3,261	Jun-15-2020, 12:34 PM Last Post: tsurubaso
	Wrong number of google results in a date range	Val	0	1,860	Mar-15-2020, 02:29 PM Last Post: Val
	Project: “I’m Feeling Lucky” Google Search	Truman	31	28,435	Jul-09-2019, 04:20 PM Last Post: tab_lo_lo
	Outputing the results of search machine	Emmanouil	4	5,046	Nov-07-2016, 05:20 PM Last Post: nilamo

How to use BeautifulSoup to parse google search results

User Panel Messages

Announcements