Having issue with regular expressions

azulu · (This post was last modified: May-19-2020, 04:48 AM by azulu.)

For this homework assignment I have been tasked with web scraping using regex (I know it isn't best practice). I believe the code I have is correct, and the issue is to do with the regular expression itself, or how it has been implemented in the code

def web_scrape ():
    html = urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlProducts = re.compile('<span\s(?:class=.p13n-sc-truncated.)>(.*)<\/span\>')
    htmlPrices = re.compile('<span\s(?:class=.p13n-sc-price.)>(.*)<\/span\>')
    products = re.findall(htmlProducts, html_contents)
    prices = re.findall(htmlPrices, html_contents)
    print(prices)
    print (products)

I used the regular expression in a provided regex tester (without the '' and () around them) and it worked, but for some reason am having issues actually using it in my program.

Any help would be wonderful. Thank you.

And just to add on, it's not returning any errors, just when I print, it only returns [].

drcodie · (This post was last modified: May-20-2020, 11:55 AM by Yoriz.)

truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:

def web_scrape ():
    html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
    prices = re.findall(htmlPrices, html_contents)
    print(prices)

Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM

pyzyx3qwerty · (This post was last modified: May-20-2020, 10:43 AM by pyzyx3qwerty.)

(May-20-2020, 10:39 AM)drcodie Wrote: truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:
def web_scrape ():
html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
html_contents = html.read().decode()
htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
prices = re.findall(htmlPrices, html_contents)
print(prices)

Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM

Please use proper code tags while posting
And also preferably don't post answers; this is homework section

**Yoriz** · May-20-2020, 12:00 PM

(May-20-2020, 10:43 AM)pyzyx3qwerty Wrote: And also preferably don't post answers; this is homework section

It's fine in this instance

Quote:You are free to post in any situation:
if the author requests improvements on their code in which you are not spoiling anything for them

DeaD_EyE · May-20-2020, 12:51 PM

You should avoid using regex to parse HTML.
Use a HTML parser for this task.
BeatufilSoup is a good candidate.

https://blog.codinghorror.com/parsing-ht...hulhu-way/
https://stackoverflow.com/questions/6751...tion-in-la

drcodie · May-21-2020, 08:15 AM

apologies, thank you pyzyx3qwerty

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	calculator with +2 expressions	El_gnichi	1	1,764	Nov-25-2020, 11:42 PM Last Post: Larz60+
	Regular Expressions in Files (find all phone numbers and credit card numbers)	Amirsalar	2	4,102	Dec-05-2017, 09:48 AM Last Post: DeaD_EyE

Having issue with regular expressions

User Panel Messages

Announcements