Python Forum
Dealing with duplicates to an Excel sheet
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Dealing with duplicates to an Excel sheet
#1
Alright, so, first time ever taking a programming course and I am just struggling with wrapping my head around Python. The following is Python 2.7.

import urllib
import win32com.client

def parse(searchstring,leader, trailer): # function to parse information between a leading string and a trailing string
        try:
            end_of_leader = searchstring.index(leader) + len(leader)
            start_of_trailer = searchstring.index(trailer, end_of_leader)
            return searchstring[end_of_leader:start_of_trailer]
        except ValueError:          
            return ""

def returnpages(s): # function to return the number of pages 
        try:
            returnpages = parse(s,'yp_searchTotalPages" : "','",')
            return int(returnpages)
        except:
            return 0



# Download the business entries 
page = 1 
fileobj = urllib.urlopen("http://canada411.yellowpages.ca/search/si/1/Cineplex/Canada") 

s = fileobj.read()              # Read the first page - storing the page's contents in a string called 's'.
resultlist = list()             # List that will be used to store our results

pages = returnpages(s)

for page in xrange(1,6):

    fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/"  + str(page) + "/Cineplex/Canada") 
    s = fileobj.read()
    
    addresssplit = s.split('<div class="listing__address address mainLocal noNum">') # split the page up into a list based on this delimiter
    
    for item in addresssplit:
        address = parse(item,'<span class="jsMapBubbleAddress"  itemprop="streetAddress" >','</span>') # extract address
        city = parse(item,'<span class="jsMapBubbleAddress"  itemprop="addressLocality" >','</span>')
        province = parse(item,'<span class="jsMapBubbleAddress"  itemprop="addressRegion" >','</span>')
        postal = parse(item,'<span class="jsMapBubbleAddress"  itemprop="postalCode" >','</span>')
        phone = parse(item,'data-phone="','">')
        
        if len(address) > 0:
            resultlist.append({'address':address,'city':city, 'province':province,'postal':postal})
            print address,city,province,postal,phone # take this out when you're finished your script - this is just for testing

    # Switch to pages 2,3,4,5 and so on based on variable page
    fileobj = urllib.urlopen("https://www.yellowpages.ca/search/si/" + str(page) + "/Cineplex/Canada")
    s = fileobj.read()              # Read the next page - storing the page's contents in a string called 's'.


row = 1



excelobj = win32com.client.Dispatch("Excel.Application")
excelobj.Visible = 1                                                            # make excel visible
excelobj.Workbooks.Add()                                                        # add a workbook
sh = excelobj.ActiveSheet                                                       # sh variable holds the active sheet
sh.Cells(1,1).Value = 'Address'                                           # excel headers
sh.Cells(1,2).Value = 'City'                                                 # excel headers
sh.Cells(1,3).Value = 'Province'                                                   # excel headers

for business in resultlist:                                                     # go through each dictionary item in the list resultlist
    sh.Cells(row,1).Value = business["address"]                            # put address from the dictionary in the 1st column of excel
    sh.Cells(row,2).Value = business["city"]                                 # put city from the dictionary in the 2nd column of excel    
    sh.Cells(row,3).Value = business["province"]                                   # put phone number from the dictionary in the 3rd column of excel   
    row +=1                                                                     # after all the information is input, switch to the next row

print str(len(resultlist)) + ' rows written to excel.'
When I run the program, it looks up all the addresses and puts them where they need to be in the Excel file, but I don't know how to deal with duplicates. I tried googling around and using different codes people suggested to others, but nothing works, so clearly I just do not know where to even begin.

So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this.

I also do not know how to replace accented characters with their closest equivalent, so that French does not appear weird in the list. Bit of a frustrated student here that isn't understanding how to even grasp the basics.
Reply
#2
FYI: Python 2.7 is obsolete and reached end of life Jan 1, 2020.
Reply
#3
(Oct-19-2020, 10:42 PM)Larz60+ Wrote: FYI: Python 2.7 is obsolete and reached end of life Jan 1, 2020.

I am aware, thank you, but I am in GIS and according to my professor Python 2.7. is still the most compatible Python version with ArcMap. Which means I have to use 2.7.

Still looking for help with the original question. How do I go about preventing duplicate list items? I tried using:

resultlist.remove('Duplicate Element')
resultlist.pop('Duplicate Element')
resultlist.remove(id,'Duplicate Element')
But they are for some reason not working at all or removing half of the entire list.
Reply
#4
I use openstreetmap with python 3 everyday (but not arcgis).

Found this about ArcGIS, you should show this to your prof:
https://pro.arcgis.com/en/pro-app/arcpy/...is-pro.htm
Reply
#5
Still really need help with the original question. Anyone?
Reply
#6
You can remove duplicates in a list by using the function unique_everseen() that is described in the itertools module's documentation. It is also available in the third party module more_itertools.

To replace french accented letters by their nearest equivalent, I would use the unidecode module available on pypi. It is strange however to remove french accents in french words while Excel handles unicode data very well as all modern programs do.
Reply
#7
Quote:So, for example, when the program takes the address and puts them into an Excel sheet two of the addresses are "Main-1303 Yonge St". I'm trying to figure out how to get the code to remove a duplication like this.

What about using set data type for storing addresses? A set doesn't hold duplicate items.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  help with adding duplicates elements together in a list 2ECC3O 5 1,971 Sep-10-2022, 07:11 AM
Last Post: 2ECC3O
  Dealing with list coja56 2 2,758 Sep-18-2020, 09:32 PM
Last Post: perfringo
  App for excel sheet analysis awookado 6 2,737 Jan-27-2020, 09:02 AM
Last Post: awookado
  Help me with yahoo finance excel sheet Learner4life 2 2,243 Nov-27-2019, 12:55 PM
Last Post: Learner4life
  duplicates nsx200 3 2,353 Nov-12-2019, 08:55 AM
Last Post: nsx200
  Exercise list remove duplicates RavCOder 9 5,200 Oct-23-2019, 04:16 PM
Last Post: jefsummers
  Dealing with errors in input souprqtpie 7 3,384 Apr-10-2019, 11:04 AM
Last Post: souprqtpie
  how to take out duplicates from a list? helpme 2 2,465 Mar-08-2019, 02:27 PM
Last Post: ichabod801
  How to keep duplicates and remove all other items in list? student8 1 4,914 Oct-28-2017, 05:52 AM
Last Post: heiner55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020