Oct-19-2020, 07:40 PM
(This post was last modified: Oct-19-2020, 07:40 PM by DistraughtMuffin.)
Alright, so, first time ever taking a programming course and I am just struggling with wrapping my head around Python. The following is Python 2.7.
So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this.
I also do not know how to replace accented characters with their closest equivalent, so that French does not appear weird in the list. Bit of a frustrated student here that isn't understanding how to even grasp the basics.
import urllib import win32com.client def parse(searchstring,leader, trailer): # function to parse information between a leading string and a trailing string try: end_of_leader = searchstring.index(leader) + len(leader) start_of_trailer = searchstring.index(trailer, end_of_leader) return searchstring[end_of_leader:start_of_trailer] except ValueError: return "" def returnpages(s): # function to return the number of pages try: returnpages = parse(s,'yp_searchTotalPages" : "','",') return int(returnpages) except: return 0 # Download the business entries page = 1 fileobj = urllib.urlopen("http://canada411.yellowpages.ca/search/si/1/Cineplex/Canada") s = fileobj.read() # Read the first page - storing the page's contents in a string called 's'. resultlist = list() # List that will be used to store our results pages = returnpages(s) for page in xrange(1,6): fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/" + str(page) + "/Cineplex/Canada") s = fileobj.read() addresssplit = s.split('<div class="listing__address address mainLocal noNum">') # split the page up into a list based on this delimiter for item in addresssplit: address = parse(item,'<span class="jsMapBubbleAddress" itemprop="streetAddress" >','</span>') # extract address city = parse(item,'<span class="jsMapBubbleAddress" itemprop="addressLocality" >','</span>') province = parse(item,'<span class="jsMapBubbleAddress" itemprop="addressRegion" >','</span>') postal = parse(item,'<span class="jsMapBubbleAddress" itemprop="postalCode" >','</span>') phone = parse(item,'data-phone="','">') if len(address) > 0: resultlist.append({'address':address,'city':city, 'province':province,'postal':postal}) print address,city,province,postal,phone # take this out when you're finished your script - this is just for testing # Switch to pages 2,3,4,5 and so on based on variable page fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/" + str(page) + "/Cineplex/Canada") s = fileobj.read() # Read the next page - storing the page's contents in a string called 's'. row = 1 excelobj = win32com.client.Dispatch("Excel.Application") excelobj.Visible = 1 # make excel visible excelobj.Workbooks.Add() # add a workbook sh = excelobj.ActiveSheet # sh variable holds the active sheet sh.Cells(1,1).Value = 'Address' # excel headers sh.Cells(1,2).Value = 'City' # excel headers sh.Cells(1,3).Value = 'Province' # excel headers for business in resultlist: # go through each dictionary item in the list resultlist sh.Cells(row,1).Value = business["address"] # put address from the dictionary in the 1st column of excel sh.Cells(row,2).Value = business["city"] # put city from the dictionary in the 2nd column of excel sh.Cells(row,3).Value = business["province"] # put phone number from the dictionary in the 3rd column of excel row +=1 # after all the information is input, switch to the next row print str(len(resultlist)) + ' rows written to excel.'When I run the program, it looks up all the addresses and puts them where they need to be in the Excel file, but I don't know how to deal with duplicates. I tried googling around and using different codes people suggested to others, but nothing works, so clearly I just do not know where to even begin.
So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this.
I also do not know how to replace accented characters with their closest equivalent, so that French does not appear weird in the list. Bit of a frustrated student here that isn't understanding how to even grasp the basics.