Python Forum

Full Version: Having issue with regular expressions
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
For this homework assignment I have been tasked with web scraping using regex (I know it isn't best practice). I believe the code I have is correct, and the issue is to do with the regular expression itself, or how it has been implemented in the code

def web_scrape ():
    html = urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlProducts = re.compile('<span\s(?:class=.p13n-sc-truncated.)>(.*)<\/span\>')
    htmlPrices = re.compile('<span\s(?:class=.p13n-sc-price.)>(.*)<\/span\>')
    products = re.findall(htmlProducts, html_contents)
    prices = re.findall(htmlPrices, html_contents)
    print(prices)
    print (products)
I used the regular expression in a provided regex tester (without the '' and () around them) and it worked, but for some reason am having issues actually using it in my program.

Any help would be wonderful. Thank you.

And just to add on, it's not returning any errors, just when I print, it only returns [].
truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:
def web_scrape ():
    html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
    prices = re.findall(htmlPrices, html_contents)
    print(prices)
Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM
(May-20-2020, 10:39 AM)drcodie Wrote: [ -> ]truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:
def web_scrape ():
html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
html_contents = html.read().decode()
htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
prices = re.findall(htmlPrices, html_contents)
print(prices)

Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM

Please use proper code tags while posting
And also preferably don't post answers; this is homework section
(May-20-2020, 10:43 AM)pyzyx3qwerty Wrote: [ -> ]And also preferably don't post answers; this is homework section

It's fine in this instance

Quote:You are free to post in any situation:
  • if the author requests improvements on their code in which you are not spoiling anything for them
You should avoid using regex to parse HTML.
Use a HTML parser for this task.
BeatufilSoup is a good candidate.

https://blog.codinghorror.com/parsing-ht...hulhu-way/
https://stackoverflow.com/questions/6751...tion-in-la
apologies, thank you pyzyx3qwerty