web scraping with python regular expression

web scraping with python regular expression - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: web scraping with python regular expression (/thread-5265.html)

web scraping with python regular expression - dbpython2017 - Sep-25-2017

Given a website, I want to use python regular expression to get the data out from the web page.
Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage.

If I want to find the Mexican restaurants in Dalals, I go to this link
https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1

From here I want to find (from the first page) the restaurant name, its ranking, its number of reviews etc.
This information will be stored in the list of dictionaries.
My program looks like as follows but I m stuck on how to loop / how to find the required data.

import urllib.request
from re import findall
import re

url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()

RE: web scraping with python regular expression - snippsat - Sep-25-2017

Added code tag,look at BBcode help.

(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.

That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.

RE: web scraping with python regular expression - metulburr - Sep-25-2017

Quote:Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage.

Who set this goal?

RE: web scraping with python regular expression - dbpython2017 - Sep-25-2017

(Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help.

(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.
That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.

I know that is a bad requirement but that is what homework instructions are.
So please let me know if you have an example to scrape the data with regular expressions.

(Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help.

(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.
That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.

We learned the regular expression this week.
And therefore for the homework, we need to use only regular expressions.

RE: web scraping with python regular expression - metulburr - Sep-25-2017

Just as long as you are aware that is the wrong tool for the job. Here is to show you how to get the title from the page with beautiful soup. And that the regex you will make is unnecessary, complex, and sometimes unreadable.

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1')
>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> lis = soup.find_all('li', {'class':'regular-search-result'})
>>> for li in lis:
...     li.find('a', {'class':'biz-name js-analytics-click'}).text
... 
'Meso Maya Comida Y Copas'
'Gabriela & Sofia’s Tex-Mex'
'E Bar Tex-Mex'
'Mi Camino Restaurante'
'Desperados Mexican Restaurant'
'Campuzano Mexican Food'
'Pepe’s & Mito’s Mexican Café'
'Mia’s Tex-Mex Restaurant'
'Avilas Mexican Restaurant'
'Mesero - Henderson'
>>>

In that case i would research how to use regex.
https://www.summet.com/dmsi/html/readingTheWeb.html

and your going to inspect a lot from your browser.

RE: web scraping with python regular expression - snippsat - Sep-25-2017

Example to get reviews with regex.

import urllib.request
from re import findall
import re
from pprint import pprint

url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()

r = re.findall(r'\d+\s\breviews\b', htmlStr)
pprint(r)

Output:['251 reviews',
 '1209 reviews',
 '295 reviews',
 '389 reviews',
 '143 reviews',
 '351 reviews',
 '364 reviews',
 '394 reviews',
 '598 reviews',
 '341 reviews',
 '214 reviews']

RE: web scraping with python regular expression - dbpython2017 - Sep-26-2017

Thanks much for your help.
Now i know how to use the regular expressions in this case.

Very much appreicated.