Python Forum
Having issue with regular expressions
Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Having issue with regular expressions
#1
For this homework assignment I have been tasked with web scraping using regex (I know it isn't best practice). I believe the code I have is correct, and the issue is to do with the regular expression itself, or how it has been implemented in the code

def web_scrape ():
    html = urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlProducts = re.compile('<span\s(?:class=.p13n-sc-truncated.)>(.*)<\/span\>')
    htmlPrices = re.compile('<span\s(?:class=.p13n-sc-price.)>(.*)<\/span\>')
    products = re.findall(htmlProducts, html_contents)
    prices = re.findall(htmlPrices, html_contents)
    print(prices)
    print (products)
I used the regular expression in a provided regex tester (without the '' and () around them) and it worked, but for some reason am having issues actually using it in my program.

Any help would be wonderful. Thank you.

And just to add on, it's not returning any errors, just when I print, it only returns [].
Reply
#2
truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:
def web_scrape ():
    html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
    html_contents = html.read().decode()
    htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
    prices = re.findall(htmlPrices, html_contents)
    print(prices)
Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM
Reply
#3
(May-20-2020, 10:39 AM)drcodie Wrote: truncated is in div tags therefore the list of products is empty even if you change it from span. For price the following works:
def web_scrape ():
html = u.urlopen("https://www.amazon.com.au/gp/bestsellers/videogames/")
html_contents = html.read().decode()
htmlPrices = re.compile('sc-price[^\>]*\>(.*?)\<\/span\>')
prices = re.findall(htmlPrices, html_contents)
print(prices)

Don't forget the .*? or you will end up with everything!
If you like this then please like the findall video at https://youtu.be/YCSPhLOMPFM

Please use proper code tags while posting
And also preferably don't post answers; this is homework section
pyzyx3qwerty
"The greatest glory in living lies not in never falling, but in rising every time we fall." - Nelson Mandela
Need help on the forum? Visit help @ python forum
For learning more and more about python, visit Python docs
Reply
#4
(May-20-2020, 10:43 AM)pyzyx3qwerty Wrote: And also preferably don't post answers; this is homework section

It's fine in this instance

Quote:You are free to post in any situation:
  • if the author requests improvements on their code in which you are not spoiling anything for them
Reply
#5
You should avoid using regex to parse HTML.
Use a HTML parser for this task.
BeatufilSoup is a good candidate.

https://blog.codinghorror.com/parsing-ht...hulhu-way/
https://stackoverflow.com/questions/6751...tion-in-la
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
apologies, thank you pyzyx3qwerty
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  calculator with +2 expressions El_gnichi 1 1,764 Nov-25-2020, 11:42 PM
Last Post: Larz60+
  Regular Expressions in Files (find all phone numbers and credit card numbers) Amirsalar 2 4,102 Dec-05-2017, 09:48 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020