Python Forum
web scraping with python regular expression
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping with python regular expression
#1
Given a website, I want to use python regular expression to get the data out from the web page.
Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage.

If I want to find the Mexican restaurants in Dalals, I go to this link
https://www.yelp.com/search?find_desc=Re...2C+TX&ns=1

From here I want to find (from the first page) the restaurant name, its ranking, its number of reviews etc.
This information will be stored in the list of dictionaries.
My program looks like as follows but I m stuck on how to loop / how to find the required data.
import urllib.request
from re import findall
import re

url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()
Reply
#2
Added code tag,look at BBcode help.
(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.
That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.
Reply
#3
Quote:Although I can use other packages, my requirement is only to use regular expression to get the required data from webpage.
Who set this goal?
Recommended Tutorials:
Reply
#4
(Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help.
(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.
That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.

I know that is a bad requirement but that is what homework instructions are.
So please let me know if you have an example to scrape the data with regular expressions.

(Sep-25-2017, 08:03 PM)snippsat Wrote: Added code tag,look at BBcode help.
(Sep-25-2017, 06:58 PM)dbpython2017 Wrote: my requirement is only to use regular expression to get the required data from webpage.
That's a bad requirement,regex alone is wrong tool for HTML/XML.
This funny post is a good read,i have a tutorial here.

We learned the regular expression this week.
And therefore for the homework, we need to use only regular expressions.
Reply
#5
Just as long as you are aware that is the wrong tool for the job. Here is to show you how to get the title from the page with beautiful soup. And that the regex you will make is unnecessary, complex, and sometimes unreadable.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1')
>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> lis = soup.find_all('li', {'class':'regular-search-result'})
>>> for li in lis:
...     li.find('a', {'class':'biz-name js-analytics-click'}).text
... 
'Meso Maya Comida Y Copas'
'Gabriela & Sofia’s Tex-Mex'
'E Bar Tex-Mex'
'Mi Camino Restaurante'
'Desperados Mexican Restaurant'
'Campuzano Mexican Food'
'Pepe’s & Mito’s Mexican Café'
'Mia’s Tex-Mex Restaurant'
'Avilas Mexican Restaurant'
'Mesero - Henderson'
>>> 
In that case i would research how to use regex.
https://www.summet.com/dmsi/html/readingTheWeb.html

and your going to inspect a lot from your browser.
Recommended Tutorials:
Reply
#6
Example to get reviews with regex.
import urllib.request
from re import findall
import re
from pprint import pprint

url = "https://www.yelp.com/search?find_desc=Restaurants+Mexican&find_loc=Dallas%2C+TX&ns=1"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()

r = re.findall(r'\d+\s\breviews\b', htmlStr)
pprint(r)
Output:
['251 reviews',  '1209 reviews',  '295 reviews',  '389 reviews',  '143 reviews',  '351 reviews',  '364 reviews',  '394 reviews',  '598 reviews',  '341 reviews',  '214 reviews']
Reply
#7
Thanks much for your help.
Now i know how to use the regular expressions in this case.

Very much appreicated.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regular Expression rakhmadiev 6 5,301 Aug-21-2023, 01:52 PM
Last Post: Gribouillis
  BeautifulSoup : how to have a html5 attribut searched for in a regular expression ? arbiel 2 2,576 May-09-2020, 03:05 PM
Last Post: arbiel
  Extract text from tag content using regular expression Pavel_47 8 5,101 Nov-25-2019, 03:17 PM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020