Python Forum

Pages: 1 2

Hi all, I have a very specific Excel task I need to complete, and was hoping to get some advice.

I have a single Excel worksheet with roughly 30,000 trivia question in it. However, it currently contains questions that are either very similar or straight-up duplicates. I need to delete all duplicates as well as questions that are "too similar" to other questions.

Where it gets rough is with the "too-similar" questions. For instance, we have a lot of "If the model is a [car-model], what is the make?". Basically, questions that are totally different, but contain 90%+ similar wording. A simple percent-based comparison algorithm would flag these as duplicates and delete usable questions.

---

So here is the solution I'm currently toying with:

1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Then I would make a simple gui application where I could select a percentage and get a list of all questions that had matches within that percentage, and could decide what is and isn't a duplicate on a case-by-case basis.

---

So that's where I am, but I'm a total Python noob and wasn't sure if there were any packages or anything that could help with this kind of task. Any advice is appreciated. Thanks!

Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance

>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello'])

Makes sense?

(Jan-05-2018, 06:04 PM)hshivaraj Wrote: [ -> ]
Quote:1st, run a pass to delete 100% matches. Load first question, compare it with every subsequent question, delete 100% matches; load 2nd question, compare it with every subsequent question, and so on until all 100% matches have been dealt with.

Easy way of removing duplicated is to convert then into a set. Doing so will delete all duplicates as in terms of mathematics. For instance
>>> set(list(["hello", "world", "hello", "world"]))
set(['world', 'hello'])
Makes sense?

You are a genius! Yes, that makes total sense. Thanks for the tip!

I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)

(Jan-05-2018, 06:24 PM)Povellesto Wrote: [ -> ]I had a question, are you using python to make this excel or using excel itself? Cause if you convert is to a google spreadsheet you can code it to be able to take out the duplicates and similar answers (I think, not quite sure how though)

It's a little overly-complicated. I had a developer who created a CMS for my pub trivia business. It's built on Electron and hosted on Azure. The developer dropped out a few months ago and now the programs he created won't work anymore.

I was able to dig into the Azure data and find all of the SQL info (questions, timestamps, etc). I copied that info and pasted it into an Excel file. Excel is what we used in the past, before the custom CMS. I suppose there's no reason we can't go with another spreadsheet program, but my colleagues in TN aren't tech-savvy whatsoever, so I'd prefer to stick with formats and programs with which they are already familiar.

If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.

The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.

That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.

(Jan-05-2018, 07:41 PM)Gribouillis Wrote: [ -> ]If you are using Python, the simplest storage until you have found the duplicates is a pickle file containing a list of tuples (or namedtuples). Once the duplicates are found, you can write a new Excel file.

The reason for this is that Python is slow when it reads Excel files, while loading such a pickle file with 30000 records takes a fraction of a second.

That said, without more information about the contents of the data, it is difficult to elaborate a good strategy.

Oh wow, ok, I'll look into that. I'm not familiar with pickle files, but I'm sure a quick Google will tell me everything I need to know. Thanks!

Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl, then use it as many times as you need with WRITE = False to use the data.

from collections import namedtuple
import pickle

Record = namedtuple('Record', "id foo bar baz")

WRITE = False


if WRITE:
    L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
    print('Writing')
    with open('foo.pkl', 'wb') as ofh:
        pickle.dump(L, ofh)
else:
    print('Reading')
    with open('foo.pkl', 'rb') as ifh:
        L = pickle.load(ifh)
print(len(L))
print(L[17865])

(Jan-05-2018, 09:27 PM)Gribouillis Wrote: [ -> ]Here is a basic example of writing such a pickle file. Use it once with WRITE = True to create the file foo.pkl, then use it as many times as you need with WRITE = False to use the data.
from collections import namedtuple
import pickle

Record = namedtuple('Record', "id foo bar baz")

WRITE = False


if WRITE:
    L = [Record(i, "a"*100, "b" * 50, "c" * 50) for i in range(30000)]
    print('Writing')
    with open('foo.pkl', 'wb') as ofh:
        pickle.dump(L, ofh)
else:
    print('Reading')
    with open('foo.pkl', 'rb') as ifh:
        L = pickle.load(ifh)
print(len(L))
print(L[17865])

Wow, thanks for this! I'm about to start working on it and this will be a huge help. This will be my first functional Python program, so I'm sure I'll be posting a billion more questions over the next few hours...

P.S. I'm totally new on these forums. Is there a good way to reply or tag someone without quoting their entire message?

I don't know, I'm new too! It seems that you don't need to quote the message.

Pages: 1 2

karaokelove

hshivaraj

karaokelove

Povellesto

karaokelove

Gribouillis

karaokelove

Gribouillis

karaokelove

Gribouillis