Dealing with duplicates to an Excel sheet - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: Dealing with duplicates to an Excel sheet (/thread-30398.html) |
Dealing with duplicates to an Excel sheet - DistraughtMuffin - Oct-19-2020 Alright, so, first time ever taking a programming course and I am just struggling with wrapping my head around Python. The following is Python 2.7. import urllib import win32com.client def parse(searchstring,leader, trailer): # function to parse information between a leading string and a trailing string try: end_of_leader = searchstring.index(leader) + len(leader) start_of_trailer = searchstring.index(trailer, end_of_leader) return searchstring[end_of_leader:start_of_trailer] except ValueError: return "" def returnpages(s): # function to return the number of pages try: returnpages = parse(s,'yp_searchTotalPages" : "','",') return int(returnpages) except: return 0 # Download the business entries page = 1 fileobj = urllib.urlopen("http://canada411.yellowpages.ca/search/si/1/Cineplex/Canada") s = fileobj.read() # Read the first page - storing the page's contents in a string called 's'. resultlist = list() # List that will be used to store our results pages = returnpages(s) for page in xrange(1,6): fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/" + str(page) + "/Cineplex/Canada") s = fileobj.read() addresssplit = s.split('<div class="listing__address address mainLocal noNum">') # split the page up into a list based on this delimiter for item in addresssplit: address = parse(item,'<span class="jsMapBubbleAddress" itemprop="streetAddress" >','</span>') # extract address city = parse(item,'<span class="jsMapBubbleAddress" itemprop="addressLocality" >','</span>') province = parse(item,'<span class="jsMapBubbleAddress" itemprop="addressRegion" >','</span>') postal = parse(item,'<span class="jsMapBubbleAddress" itemprop="postalCode" >','</span>') phone = parse(item,'data-phone="','">') if len(address) > 0: resultlist.append({'address':address,'city':city, 'province':province,'postal':postal}) print address,city,province,postal,phone # take this out when you're finished your script - this is just for testing # Switch to pages 2,3,4,5 and so on based on variable page fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/" + str(page) + "/Cineplex/Canada") s = fileobj.read() # Read the next page - storing the page's contents in a string called 's'. row = 1 excelobj = win32com.client.Dispatch("Excel.Application") excelobj.Visible = 1 # make excel visible excelobj.Workbooks.Add() # add a workbook sh = excelobj.ActiveSheet # sh variable holds the active sheet sh.Cells(1,1).Value = 'Address' # excel headers sh.Cells(1,2).Value = 'City' # excel headers sh.Cells(1,3).Value = 'Province' # excel headers for business in resultlist: # go through each dictionary item in the list resultlist sh.Cells(row,1).Value = business["address"] # put address from the dictionary in the 1st column of excel sh.Cells(row,2).Value = business["city"] # put city from the dictionary in the 2nd column of excel sh.Cells(row,3).Value = business["province"] # put phone number from the dictionary in the 3rd column of excel row +=1 # after all the information is input, switch to the next row print str(len(resultlist)) + ' rows written to excel.'When I run the program, it looks up all the addresses and puts them where they need to be in the Excel file, but I don't know how to deal with duplicates. I tried googling around and using different codes people suggested to others, but nothing works, so clearly I just do not know where to even begin. So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this. I also do not know how to replace accented characters with their closest equivalent, so that French does not appear weird in the list. Bit of a frustrated student here that isn't understanding how to even grasp the basics. RE: Dealing with duplicates to an Excel sheet - Larz60+ - Oct-19-2020 FYI: Python 2.7 is obsolete and reached end of life Jan 1, 2020. RE: Dealing with duplicates to an Excel sheet - DistraughtMuffin - Oct-19-2020 (Oct-19-2020, 10:42 PM)Larz60+ Wrote: FYI: Python 2.7 is obsolete and reached end of life Jan 1, 2020. I am aware, thank you, but I am in GIS and according to my professor Python 2.7. is still the most compatible Python version with ArcMap. Which means I have to use 2.7. Still looking for help with the original question. How do I go about preventing duplicate list items? I tried using: resultlist.remove('Duplicate Element') resultlist.pop('Duplicate Element') resultlist.remove(id,'Duplicate Element')But they are for some reason not working at all or removing half of the entire list. RE: Dealing with duplicates to an Excel sheet - Larz60+ - Oct-20-2020 I use openstreetmap with python 3 everyday (but not arcgis). Found this about ArcGIS, you should show this to your prof: https://pro.arcgis.com/en/pro-app/arcpy/get-started/python-migration-for-arcgis-pro.htm RE: Dealing with duplicates to an Excel sheet - DistraughtMuffin - Oct-20-2020 Still really need help with the original question. Anyone? RE: Dealing with duplicates to an Excel sheet - Gribouillis - Oct-21-2020 You can remove duplicates in a list by using the function unique_everseen() that is described in the itertools module's documentation. It is also available in the third party module more_itertools. To replace french accented letters by their nearest equivalent, I would use the unidecode module available on pypi. It is strange however to remove french accents in french words while Excel handles unicode data very well as all modern programs do. RE: Dealing with duplicates to an Excel sheet - Askic - Oct-28-2020 Quote:So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this. What about using set data type for storing addresses? A set doesn't hold duplicate items. |