Posts: 16
Threads: 4
Joined: Jan 2018
Hi all, I have a very specific Excel task I need to complete, and was hoping to get some advice.
I have a single Excel worksheet with roughly 30,000 trivia question in it. However, it currently contains questions that are either very similar or straight-up duplicates. I need to delete all duplicates as well as questions that are "too similar" to other questions.
Where it gets rough is with the "too-similar" questions. For instance, we have a lot of "If the model is a [car-model], what is the make?". Basically, questions that are totally different, but contain 90%+ similar wording. A simple percent-based comparison algorithm would flag these as duplicates and delete usable questions.
---
So here is the solution I'm currently toying with:
1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.
Then I would make a simple gui application where I could select a percentage and get a list of all questions that had matches within that percentage, and could decide what is and isn't a duplicate on a case-by-case basis.
---
So that's where I am, but I'm a total Python noob and wasn't sure if there were any packages or anything that could help with this kind of task. Any advice is appreciated. Thanks!
Posts: 103
Threads: 15
Joined: Nov 2017
Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.
Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance
>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello']) Makes sense?
Posts: 16
Threads: 4
Joined: Jan 2018
(Jan-05-2018, 06:04 PM)hshivaraj Wrote: Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.
Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance
>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello']) Makes sense?
You are a genius! Yes, that makes total sense. Thanks for the tip!
Posts: 1
Threads: 0
Joined: Jan 2018
I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)
Posts: 16
Threads: 4
Joined: Jan 2018
(Jan-05-2018, 06:24 PM)Povellesto Wrote: I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)
It's a little overly-complicated. I had a developer who created a CMS for my pub trivia business. It's built on Electron and hosted on Azure. The developer dropped out a few months ago and now the programs he created won't work anymore.
I was able to dig into the Azure data and find all of the SQL info (questions, timestamps, etc). I copied that info and pasted it into an Excel file. Excel is what we used in the past, before the custom CMS. I suppose there's no reason we can't go with another spreadsheet program, but my colleagues in TN aren't tech-savvy whatsoever, so I'd prefer to stick with formats and programs with which they are already familiar.
Posts: 4,804
Threads: 77
Joined: Jan 2018
Jan-05-2018, 07:41 PM
(This post was last modified: Jan-05-2018, 07:42 PM by Gribouillis.)
If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.
The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.
That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.
Posts: 16
Threads: 4
Joined: Jan 2018
(Jan-05-2018, 07:41 PM)Gribouillis Wrote: If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.
The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.
That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.
Oh wow, ok, I'll look into that. I'm not familiar with pickle files, but I'm sure a quick Google will tell me everything I need to know. Thanks!
Posts: 4,804
Threads: 77
Joined: Jan 2018
Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl , then use it as many times as you need with WRITE = False to use the data.
from collections import namedtuple
import pickle
Record = namedtuple('Record', "id foo bar baz")
WRITE = False
if WRITE:
L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
print('Writing')
with open('foo.pkl', 'wb') as ofh:
pickle.dump(L, ofh)
else:
print('Reading')
with open('foo.pkl', 'rb') as ifh:
L = pickle.load(ifh)
print(len(L))
print(L[17865])
Posts: 16
Threads: 4
Joined: Jan 2018
Jan-05-2018, 09:35 PM
(This post was last modified: Jan-05-2018, 09:35 PM by karaokelove.)
(Jan-05-2018, 09:27 PM)Gribouillis Wrote: Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl , then use it as many times as you need with WRITE = False to use the data.
from collections import namedtuple
import pickle
Record = namedtuple('Record', "id foo bar baz")
WRITE = False
if WRITE:
L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
print('Writing')
with open('foo.pkl', 'wb') as ofh:
pickle.dump(L, ofh)
else:
print('Reading')
with open('foo.pkl', 'rb') as ifh:
L = pickle.load(ifh)
print(len(L))
print(L[17865])
Wow, thanks for this! I'm about to start working on it and this will be a huge help. This will be my first functional Python program, so I'm sure I'll be posting a billion more questions over the next few hours...
P.S. I'm totally new on these forums. Is there a good way to reply or tag someone without quoting their entire message?
Posts: 4,804
Threads: 77
Joined: Jan 2018
I don't know, I'm new too! It seems that you don't need to quote the message.
|