Python Forum
Grouping Candidates with same name
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Grouping Candidates with same name
#1
I am looking for a python program solution to this challenge:

Given a list of candidate demographics for numerous candidates, we want to be able to group the data by candidates with the same name. The demographics provided are (in order): Candidate ID, Candidate Name, Candidate Sex, Candidate Date Of Birth. For example, here's a sample input:

ID1,BROWN^JAMES,F,19890224
ID2,WILLIAMS^RORY,M,19881102
ID3,BROWN^JAMES,F,19890224
ID4,BROWN^JAMES,F,20010911


The expected output is:

0:
ID1,BROWN^JAMES,F,19890224
ID3,BROWN^JAMES,F,19890224
ID4,BROWN^JAMES,F,20010911
1:
ID2,WILLIAMS^RORY,M,19881102



Input

The program should accept a file as a parameter. The Candidate demographics fields are comma delimited, with newlines being used to designate new Candidates.

CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH

The format of the Candidate name is as follows:

LAST NAME^FIRST NAME^MIDDLE NAME

The middle name component is optional and may be omitted, but last and first name will always be present. We should consider Candidates with the same first and last name to the grouped together, even if the middle names don't match. Matches should also be case insensitive. So for the following input:

ID1,CLARA^OSWALD,F,19890224
ID2,CLARA^oswald^COLEMAN,F,19890224


the expected output would group these two together:

0:
ID1,CLARA^OSWALD,F,19890224
ID2,CLARA^oswald^COLEMAN,F,19890224


Output

A grouping of all the Candidates based on the first and last name of the Candidate. For each group, the output should look as follows:

N:
CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH (of match #1)
CANDIDATE ID, CANDIDATE NAME, CANDIDATE SEX, CANDIDATE DATE OF BIRTH (of match #2)
...

Where N is just incremented for each group. The output should be printed to standard out. The groups can be outputted in any order.

Complete Example

Input:


ID1,BROWN^JAMES,F,19890224
ID2,WILLIAMS^RORY,M,19881102
ID3,BROWN^JAMES,F,19890224
ID4,CLARA^OSWALD,F,19890224
ID5,BROWN^JAMES,F,20010911
ID6,CLAR^OSWALD,F,19890224
ID7,BROWN^AMELIA,F,20010911
ID8,CLARA^oswald,F,19890224
ID9,TYLER^ROSE,F,20000101
ID10,NOBLE^DONNA,F,19780405
ID11,TYLER^ROSE,F,20000101
ID12,NOBLE^DONN,F,19780405
ID13,TYLER^ROSE,F,20000102
ID14,CLARA^OSWALD^COLEMAN,F,19890224


Output

0:
ID1,BROWN^JAMES,F,19890224
ID3,BROWN^JAMES,F,19890224
ID5,BROWN^JAMES,F,20010911
1:
ID2,WILLIAMS^RORY,M,19881102
2:
ID4,CLARA^OSWALD,F,19890224
ID8,CLARA^oswald,F,19890224
ID14,CLARA^OSWALD^COLEMAN,F,19890224
3:
ID6,CLAR^OSWALD,F,19890224
4:
ID7,BROWN^AMELIA,F,20010911
5:
ID9,TYLER^ROSE,F,20000101
ID11,TYLER^ROSE,F,20000101
ID13,TYLER^ROSE,F,20000102
6:
ID10,NOBLE^DONNA,F,19780405
7:
ID12,NOBLE^DONN,F,19780405
Reply
#2
Are you allowed to use Pandas package? If so, look at Pandas docs, especially about grouping.
Reply
#3
If you're a beginner here, like myself, you might want to take a simpler approach using split (and only splitting once on the comma)

In a loop:
id, name_Data = myString.split(',', 1)
name_data.uppercase()

Parse (name_data, id) into a list, sort the list, and output it to another list or file with (id, namedata) sequence, and titlecase the output for neatness.

I've done similar in Perl - still learning python - but it does have a similar 'split'
Reply
#4
In order to group first and last name are needed. Therefore this task boils down to how extract names from row.

Rows has different structures or letter types:

ID4,CLARA^OSWALD,F,19890224
ID8,CLARA^oswald,F,19890224
ID14,CLARA^OSWALD^COLEMAN,F,19890224

What these rows have in common? Last name is before and first name after ^. Note, that in one example there is two ^. What will happen if we split on this symbol:

>>> lst = ['ID4,CLARA^OSWALD,F,19890224', 'ID8,CLARA^oswald,F,1989022', 'ID14,CLARA^OSWALD^COLEMAN,F,19890224']
>>> for row in lst:
...     print(row.split('^'))
...
['ID4,CLARA', 'OSWALD,F,19890224']
['ID8,CLARA', 'oswald,F,19890224']
['ID14,CLARA', 'OSWALD', 'COLEMAN,F,19890224']
We can observe, that last name is in first element after comma; first name is in second element, either before first comma or as whole name. Lets modify code so that we will get first and last name:

>>> for row in lst: 
...     splitted = row.split('^') 
...     last_name, first_name = splitted[0].split(',')[1], splitted[1].split(',')[0] 
...     print(last_name, first_name)
...
CLARA OSWALD
CLARA oswald
CLARA OSWALD
We can observe that names have different types of letters, we should unify them either using str.upper() or str.lower() method.

As this is homework you should figure out yourself what to do if you are able to get names out of rows.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Thank you all.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  unicode within a RE grouping bluefrog 2 3,076 Jun-09-2018, 09:06 AM
Last Post: snippsat
  column grouping (sum) metalray 2 4,543 Mar-07-2017, 07:15 PM
Last Post: zivoni

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020