Python Forum - Normalising a csv file with Python

Hi,

I have the below source data

Student Age Subject
Adam 13 Maths,Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths,Biology,English

I want to convert the data into a table format like below in python.

Student Age Subject
Adam 13 Maths
Adam 13 Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths
Nathan 16 Biology
Nathan 16 English

Thanks in advance.

Hello, how do you read your data? Do you have any code that you've worked with so far? If so post it in Python code tags.

(Jan-29-2018, 07:18 AM)nsh_py Wrote: [ -> ]Hi,

I have the below source data, which is a csv file with semi-colon delimeter.

Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English

I want to convert the data into a table format like below in python.

Student Age Subject
Adam 13 Maths
Adam 13 Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths
Nathan 16 Biology
Nathan 16 English

Thanks in advance.

(Jan-29-2018, 07:22 AM)j.crater Wrote: [ -> ]Hello, how do you read your data? Do you have any code that you've worked with so far? If so post it in Python code tags.

Hi j.crater, the input file is a csv file with semi-colon delimeter. I tried few things with pandas but it is not giving me the output i needed.

thanks

As you have probably figured, you will need to read the csv file to extract the information. Here is the section on Python's csv module with a few examples from the official docs. If you require more examples, there are countless sources on the Internet. If you will have trouble with the code or any sort of questions post it here and we will look into it.

Edit:
Post your code in Python code tags. That way we may help you find what needs to be corrected for the code to work as you want.

(Jan-29-2018, 07:43 AM)nsh_py Wrote: [ -> ]I tried few things with pandas but it is not giving me the output i needed.

You should show what you tried.
That's not the most easy structured data to work with 2 delimiter ;,.
Here a run,i had to take it in step probably some method that could do it sorter.
The problem is that Pandas has 100's of methods,so if not using it often it easy to forget.

>>> import pandas as pd
>>> from pandas import DataFrame

>>> df = pd.read_csv('student.csv', delimiter=';')
>>> df
  Student  Age                Subject
0    Adam   13          Maths,Biology
1    Alex   14                English
2   Shawn   15              Geography
3  Nathan   16  Maths,Biology,English

>>> b = DataFrame(df.Subject.str.split(',').tolist(), index=[df.Student, df.Age]).stack()
>>> b
Student  Age
Adam     13   0        Maths
              1      Biology
Alex     14   0      English
Shawn    15   0    Geography
Nathan   16   0        Maths
              1      Biology
              2      English
dtype: object

>>> # Clean up
>>> c = b.to_frame()
>>> c = c.reset_index()
>>> c = c.drop('level_2', 1)
>>> c = c.rename(columns = {0:'Subject'})
>>> c
  Student  Age    Subject
0    Adam   13      Maths
1    Adam   13    Biology
2    Alex   14    English
3   Shawn   15  Geography
4  Nathan   16      Maths
5  Nathan   16    Biology
6  Nathan   16    English

>>> # It's now a working dataframe,eg sorting
>>> c.sort_values('Subject')
  Student  Age    Subject
1    Adam   13    Biology
5  Nathan   16    Biology
2    Alex   14    English
6  Nathan   16    English
3   Shawn   15  Geography
0    Adam   13      Maths
4  Nathan   16      Maths

(Jan-29-2018, 06:16 PM)snippsat Wrote: [ -> ]

(Jan-29-2018, 07:43 AM)nsh_py Wrote: [ -> ]I tried few things with pandas but it is not giving me the output i needed.

>>> import pandas as pd
>>> from pandas import DataFrame

>>> df = pd.read_csv('student.csv', delimiter=';')
>>> df
  Student  Age                Subject
0    Adam   13          Maths,Biology
1    Alex   14                English
2   Shawn   15              Geography
3  Nathan   16  Maths,Biology,English

>>> b = DataFrame(df.Subject.str.split(',').tolist(), index=[df.Student, df.Age]).stack()
>>> b
Student  Age
Adam     13   0        Maths
              1      Biology
Alex     14   0      English
Shawn    15   0    Geography
Nathan   16   0        Maths
              1      Biology
              2      English
dtype: object

>>> # Clean up
>>> c = b.to_frame()
>>> c = c.reset_index()
>>> c = c.drop('level_2', 1)
>>> c = c.rename(columns = {0:'Subject'})
>>> c
  Student  Age    Subject
0    Adam   13      Maths
1    Adam   13    Biology
2    Alex   14    English
3   Shawn   15  Geography
4  Nathan   16      Maths
5  Nathan   16    Biology
6  Nathan   16    English

>>> # It's now a working dataframe,eg sorting
>>> c.sort_values('Subject')
  Student  Age    Subject
1    Adam   13    Biology
5  Nathan   16    Biology
2    Alex   14    English
6  Nathan   16    English
3   Shawn   15  Geography
0    Adam   13      Maths
4  Nathan   16      Maths

Thanks snippsat. I am also trying with scikit too, will post the code soon.

Depending on this input:

Output:Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English

You should use the cvs module, sorted and itertools.groupby and operator.itemgetter. Sometimes it's good to understand the use of generator expressions and list comprehensions. The module itertools should not be underestimated. I use operator.itemgetter to prevent the use of lambda expressions. I use star unpacking inside the tupel. I think this is possible since Python 3.5.

import itertools
import csv
import operator


def sort_by_subject(data):
    reader = csv.reader(data.splitlines(), delimiter=';')
    # reading the header
    header = next(reader)
    # generator expression to yield for every Subject a new dataset
    # generators are lazy evaluated
    data_each_subject = ((*row[0:2], subject) for row in reader for subject in row[2].split(','))
    # sort by subject, which is the last field (-1)
    # the sorted function iterates over the generator
    sorted_data = sorted(data_each_subject, key=operator.itemgetter(-1))
    # you could work with the sorted data or use groupby to group results
    for key, group in itertools.groupby(sorted_data, operator.itemgetter(-1)):
        # key is the object which is used to group
        # group contains the group
        for name, age, subject in group:
            # iterating over the group,
            yield name, age, subject


data = """Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English"""


import pprint
pprint.pprint(list(sort_by_subject(data)))