Grouping Candidates with same name

coolperson · Jul-11-2019, 05:11 PM

I am looking for a python program solution to this challenge:

Given a list of candidate demographics for numerous candidates, we want to be able to group the data by candidates with the same name. The demographics provided are (in order): Candidate ID, Candidate Name, Candidate Sex, Candidate Date Of Birth. For example, here's a sample input:

ID1,BROWN^JAMES,F,19890224
ID2,WILLIAMS^RORY,M,19881102
ID3,BROWN^JAMES,F,19890224
ID4,BROWN^JAMES,F,20010911

The expected output is:

0:
ID1,BROWN^JAMES,F,19890224
ID3,BROWN^JAMES,F,19890224
ID4,BROWN^JAMES,F,20010911
1:
ID2,WILLIAMS^RORY,M,19881102

Input

The program should accept a file as a parameter. The Candidate demographics fields are comma delimited, with newlines being used to designate new Candidates.

CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH

The format of the Candidate name is as follows:

LAST NAME^FIRST NAME^MIDDLE NAME

The middle name component is optional and may be omitted, but last and first name will always be present. We should consider Candidates with the same first and last name to the grouped together, even if the middle names don't match. Matches should also be case insensitive. So for the following input:

ID1,CLARA^OSWALD,F,19890224
ID2,CLARA^oswald^COLEMAN,F,19890224

the expected output would group these two together:

0:
ID1,CLARA^OSWALD,F,19890224
ID2,CLARA^oswald^COLEMAN,F,19890224

Output

A grouping of all the Candidates based on the first and last name of the Candidate. For each group, the output should look as follows:

N:
CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH (of match #1)
CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH (of match #2)
...

Where N is just incremented for each group. The output should be printed to standard out. The groups can be outputted in any order.

Complete Example

Input:

ID1,BROWN^JAMES,F,19890224
ID2,WILLIAMS^RORY,M,19881102
ID3,BROWN^JAMES,F,19890224
ID4,CLARA^OSWALD,F,19890224
ID5,BROWN^JAMES,F,20010911
ID6,CLAR^OSWALD,F,19890224
ID7,BROWN^AMELIA,F,20010911
ID8,CLARA^oswald,F,19890224
ID9,TYLER^ROSE,F,20000101
ID10,NOBLE^DONNA,F,19780405
ID11,TYLER^ROSE,F,20000101
ID12,NOBLE^DONN,F,19780405
ID13,TYLER^ROSE,F,20000102
ID14,CLARA^OSWALD^COLEMAN,F,19890224

Output

0:
ID1,BROWN^JAMES,F,19890224
ID3,BROWN^JAMES,F,19890224
ID5,BROWN^JAMES,F,20010911
1:
ID2,WILLIAMS^RORY,M,19881102
2:
ID4,CLARA^OSWALD,F,19890224
ID8,CLARA^oswald,F,19890224
ID14,CLARA^OSWALD^COLEMAN,F,19890224
3:
ID6,CLAR^OSWALD,F,19890224
4:
ID7,BROWN^AMELIA,F,20010911
5:
ID9,TYLER^ROSE,F,20000101
ID11,TYLER^ROSE,F,20000101
ID13,TYLER^ROSE,F,20000102
6:
ID10,NOBLE^DONNA,F,19780405
7:
ID12,NOBLE^DONN,F,19780405

**scidam** · Jul-11-2019, 11:42 PM

Are you allowed to use Pandas package? If so, look at Pandas docs, especially about grouping.

millpond · Jul-12-2019, 05:39 AM

If you're a beginner here, like myself, you might want to take a simpler approach using split (and only splitting once on the comma)

In a loop:
id, name_Data = myString.split(',', 1)
name_data.uppercase()

Parse (name_data, id) into a list, sort the list, and output it to another list or file with (id, namedata) sequence, and titlecase the output for neatness.

I've done similar in Perl - still learning python - but it does have a similar 'split'

**perfringo** · Jul-12-2019, 07:07 AM

In order to group first and last name are needed. Therefore this task boils down to how extract names from row.

Rows has different structures or letter types:

ID4,CLARA^OSWALD,F,19890224
ID8,CLARA^oswald,F,19890224
ID14,CLARA^OSWALD^COLEMAN,F,19890224

What these rows have in common? Last name is before and first name after ^. Note, that in one example there is two ^. What will happen if we split on this symbol:

>>> lst = ['ID4,CLARA^OSWALD,F,19890224', 'ID8,CLARA^oswald,F,1989022', 'ID14,CLARA^OSWALD^COLEMAN,F,19890224']
>>> for row in lst:
...     print(row.split('^'))
...
['ID4,CLARA', 'OSWALD,F,19890224']
['ID8,CLARA', 'oswald,F,19890224']
['ID14,CLARA', 'OSWALD', 'COLEMAN,F,19890224']

We can observe, that last name is in first element after comma; first name is in second element, either before first comma or as whole name. Lets modify code so that we will get first and last name:

>>> for row in lst: 
...     splitted = row.split('^') 
...     last_name, first_name = splitted[0].split(',')[1], splitted[1].split(',')[0] 
...     print(last_name, first_name)
...
CLARA OSWALD
CLARA oswald
CLARA OSWALD

We can observe that names have different types of letters, we should unify them either using str.upper() or str.lower() method.

As this is homework you should figure out yourself what to do if you are able to get names out of rows.

coolperson · Jul-12-2019, 07:38 PM

Thank you all.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	unicode within a RE grouping	bluefrog	2	3,076	Jun-09-2018, 09:06 AM Last Post: snippsat
	column grouping (sum)	metalray	2	4,543	Mar-07-2017, 07:15 PM Last Post: zivoni

Grouping Candidates with same name

User Panel Messages

Announcements