How to use BeautifulSoup to parse google search results - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: How to use BeautifulSoup to parse google search results (/thread-7117.html) Pages:
1
2
|
How to use BeautifulSoup to parse google search results - DevinGP - Dec-21-2017 I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: from urllib.request import urlretrieve import urllib.parse from urllib.parse import urlencode, urlparse, parse_qs import webbrowser from bs4 import BeautifulSoup import requests address = 'https://google.com/#q=' # Default Google search address start file = open( "OCR.txt", "rt" ) # Open text document that contains the question word = file.read() file.close() myList = [item for item in word.split('\n')] newString = ' '.join(myList) # The question is on multiple lines so this joins them together with proper spacing print(newString) qstr = urllib.parse.quote_plus(newString) # Encode the string newWord = address + qstr # Combine the base and the encoded query print(newWord) source = requests.get(newWord) soup = BeautifulSoup(source.text, 'lxml')The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]". I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page: https://ibb.co/jfRakR Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated. Thank you! RE: How to use BeautifulSoup to parse google search results - wavic - Dec-21-2017 The URL is like this: https://google.com/search?q=python+hello+world+tutorial You may add some other options See this RE: How to use BeautifulSoup to parse google search results - DevinGP - Dec-21-2017 (Dec-21-2017, 05:04 PM)wavic Wrote: The URL is like this: https://google.com/search?q=python+hello+world+tutorial Hello, the issue is not with making the URL, so far I have that working fine. The issue is with BeautifulSoup to parse data from said URL. I do not know the proper syntax on how to use .read() or .read_all() to get to the data that I want. (The title and summary). RE: How to use BeautifulSoup to parse google search results - nilamo - Dec-21-2017 Does this help? soup.find_all("div.g") RE: How to use BeautifulSoup to parse google search results - DevinGP - Dec-21-2017 (Dec-21-2017, 06:36 PM)nilamo Wrote: Does this help?soup.find_all("div.g") Hello, thanks for the reply! When I do this: source = requests.get(newWord) soup = BeautifulSoup(source.text, 'lxml') results = soup.find_all("div.g") print(results)All it prints is "None". That was the problem I was having as well. RE: How to use BeautifulSoup to parse google search results - metulburr - Dec-21-2017 Quote:soup.find_all("div.g")im pretty sure that find_all has no significance in a period, so it is actually searching for <div.g> tag Im assuming you mean this? soup.find_all('div', {'class':'g'})or CSS selector soup.select('.g') i think. I havent checked it for verification. RE: How to use BeautifulSoup to parse google search results - snippsat - Dec-21-2017 You are really making it difficult for yourself,google use a lot of JavaScript. JavaScript is rendered in browser, so when you see div class='g' (browser) it dos not mean that it will be download source(Requests can not render JavaScript).Can try to use Selenium/PhantomJs,i did a quick test and even using those tool is difficult to parse the mess getting back. So i would try to avoid parse result from a google search, start train with something simpler. RE: How to use BeautifulSoup to parse google search results - DevinGP - Dec-21-2017 (Dec-21-2017, 07:07 PM)metulburr Wrote:Quote:soup.find_all("div.g")im pretty sure that find_all has no significance in a period, so it is actually searching for <div.g> tag Hey! Thank you for your reply! When I try this all it returns is: "[]" Thanks for the tip though, this has been really racking my brain. RE: How to use BeautifulSoup to parse google search results - metulburr - Dec-21-2017 Quote:When I try this all it returns is:Then it probably is using javscript and you are only left with selenium as an option. I didnt know the results might be javascript though. RE: How to use BeautifulSoup to parse google search results - DevinGP - Dec-21-2017 (Dec-21-2017, 07:33 PM)metulburr Wrote:Quote:When I try this all it returns is:Then it probably is using javscript and you are only left with selenium as an option. Do you mind telling me how I would implement Selenium into my current code or at least pointing me to a tutorial on someone using it to scrape the titles and summaries? Thank you! |