web scraping with python regular expression - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: web scraping with python regular expression (/thread-5265.html) |
web scraping with python regular expression - dbpython2017 - Sep-25-2017 Given a website, I want to use python regular expression to get the data out from the web page. Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage. If I want to find the Mexican restaurants in Dalals, I go to this link https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1 From here I want to find (from the first page) the restaurant name, its ranking, its number of reviews etc. This information will be stored in the list of dictionaries. My program looks like as follows but I m stuck on how to loop / how to find the required data. import urllib.request from re import findall import re url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1" response = urllib.request.urlopen(url) html = response.read() htmlStr = html.decode() RE: web scraping with python regular expression - snippsat - Sep-25-2017 Added code tag,look at BBcode help. (Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.That's a bad requirement,regex alone is wrong tool for HTML/XML. This funny post is a good read,i have a tutorial here. RE: web scraping with python regular expression - metulburr - Sep-25-2017 Quote:Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage.Who set this goal? RE: web scraping with python regular expression - dbpython2017 - Sep-25-2017 (Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help. I know that is a bad requirement but that is what homework instructions are. So please let me know if you have an example to scrape the data with regular expressions. (Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help. We learned the regular expression this week. And therefore for the homework, we need to use only regular expressions. RE: web scraping with python regular expression - metulburr - Sep-25-2017 Just as long as you are aware that is the wrong tool for the job. Here is to show you how to get the title from the page with beautiful soup. And that the regex you will make is unnecessary, complex, and sometimes unreadable. >>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get('https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1') >>> soup = BeautifulSoup(r.text, 'html.parser') >>> lis = soup.find_all('li', {'class':'regular-search-result'}) >>> for li in lis: ... li.find('a', {'class':'biz-name js-analytics-click'}).text ... 'Meso Maya Comida Y Copas' 'Gabriela & Sofia’s Tex-Mex' 'E Bar Tex-Mex' 'Mi Camino Restaurante' 'Desperados Mexican Restaurant' 'Campuzano Mexican Food' 'Pepe’s & Mito’s Mexican Café' 'Mia’s Tex-Mex Restaurant' 'Avilas Mexican Restaurant' 'Mesero - Henderson' >>>In that case i would research how to use regex. https://www.summet.com/dmsi/html/readingTheWeb.html and your going to inspect a lot from your browser. RE: web scraping with python regular expression - snippsat - Sep-25-2017 Example to get reviews with regex. import urllib.request from re import findall import re from pprint import pprint url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1" response = urllib.request.urlopen(url) html = response.read() htmlStr = html.decode() r = re.findall(r'\d+\s\breviews\b', htmlStr) pprint(r)
RE: web scraping with python regular expression - dbpython2017 - Sep-26-2017 Thanks much for your help. Now i know how to use the regular expressions in this case. Very much appreicated. |