Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Excel Question
#1
Hi all, I have a very specific Excel task I need to complete, and was hoping to get some advice.

I have a single Excel worksheet with roughly 30,000 trivia question in it. However, it currently contains questions that are either very similar or straight-up duplicates. I need to delete all duplicates as well as questions that are "too similar" to other questions.

Where it gets rough is with the "too-similar" questions. For instance, we have a lot of "If the model is a [car-model], what is the make?". Basically, questions that are totally different, but contain 90%+ similar wording. A simple percent-based comparison algorithm would flag these as duplicates and delete usable questions.

---

So here is the solution I'm currently toying with:

1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Then I would make a simple gui application where I could select a percentage and get a list of all questions that had matches within that percentage, and could decide what is and isn't a duplicate on a case-by-case basis.

---

So that's where I am, but I'm a total Python noob and wasn't sure if there were any packages or anything that could help with this kind of task. Any advice is appreciated. Thanks!
Reply
#2
Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance

>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello'])
Makes sense?
Reply
#3
(Jan-05-2018, 06:04 PM)hshivaraj Wrote:
Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance

>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello'])
Makes sense?

You are a genius! Yes, that makes total sense. Thanks for the tip!
Reply
#4
I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)
Reply
#5
(Jan-05-2018, 06:24 PM)Povellesto Wrote: I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)

It's a little overly-complicated. I had a developer who created a CMS for my pub trivia business. It's built on Electron and hosted on Azure. The developer dropped out a few months ago and now the programs he created won't work anymore.

I was able to dig into the Azure data and find all of the SQL info (questions, timestamps, etc). I copied that info and pasted it into an Excel file. Excel is what we used in the past, before the custom CMS. I suppose there's no reason we can't go with another spreadsheet program, but my colleagues in TN aren't tech-savvy whatsoever, so I'd prefer to stick with formats and programs with which they are already familiar.
Reply
#6
If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.

The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.

That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.
Reply
#7
(Jan-05-2018, 07:41 PM)Gribouillis Wrote: If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.

The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.

That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.

Oh wow, ok, I'll look into that. I'm not familiar with pickle files, but I'm sure a quick Google will tell me everything I need to know. Thanks!
Reply
#8
Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl, then use it as many times as you need with WRITE = False to use the data.
from collections import namedtuple
import pickle

Record = namedtuple('Record', "id foo bar baz")

WRITE = False


if WRITE:
    L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
    print('Writing')
    with open('foo.pkl', 'wb') as ofh:
        pickle.dump(L, ofh)
else:
    print('Reading')
    with open('foo.pkl', 'rb') as ifh:
        L = pickle.load(ifh)
print(len(L))
print(L[17865])
Reply
#9
(Jan-05-2018, 09:27 PM)Gribouillis Wrote: Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl, then use it as many times as you need with WRITE = False to use the data.
from collections import namedtuple
import pickle

Record = namedtuple('Record', "id foo bar baz")

WRITE = False


if WRITE:
    L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
    print('Writing')
    with open('foo.pkl', 'wb') as ofh:
        pickle.dump(L, ofh)
else:
    print('Reading')
    with open('foo.pkl', 'rb') as ifh:
        L = pickle.load(ifh)
print(len(L))
print(L[17865])

Wow, thanks for this! I'm about to start working on it and this will be a huge help. This will be my first functional Python program, so I'm sure I'll be posting a billion more questions over the next few hours...

P.S. I'm totally new on these forums. Is there a good way to reply or tag someone without quoting their entire message?
Reply
#10
I don't know, I'm new too! It seems that you don't need to quote the message.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020