Hi,
I have the below source data
Student Age Subject
Adam 13 Maths,Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths,Biology,English
I want to convert the data into a table format like below in python.
Student Age Subject
Adam 13 Maths
Adam 13 Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths
Nathan 16 Biology
Nathan 16 English
Thanks in advance.
Hello, how do you read your data? Do you have any code that you've worked with so far? If so post it in Python code tags.
(Jan-29-2018, 07:18 AM)nsh_py Wrote: [ -> ]Hi,
I have the below source data, which is a csv file with semi-colon delimeter.
Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English
I want to convert the data into a table format like below in python.
Student Age Subject
Adam 13 Maths
Adam 13 Biology
Alex 14 English
Shawn 15 Geography
Nathan 16 Maths
Nathan 16 Biology
Nathan 16 English
Thanks in advance.
(Jan-29-2018, 07:22 AM)j.crater Wrote: [ -> ]Hello, how do you read your data? Do you have any code that you've worked with so far? If so post it in Python code tags.
Hi j.crater, the input file is a csv file with semi-colon delimeter. I tried few things with pandas but it is not giving me the output i needed.
thanks
As you have probably figured, you will need to read the csv file to extract the information.
Here is the section on Python's csv module with a few examples from the official docs. If you require more examples, there are countless sources on the Internet. If you will have trouble with the code or any sort of questions post it here and we will look into it.
Edit:
Post your code in Python code tags. That way we may help you find what needs to be corrected for the code to work as you want.
(Jan-29-2018, 07:43 AM)nsh_py Wrote: [ -> ]I tried few things with pandas but it is not giving me the output i needed.
You should show what you tried.
That's not the most easy structured data to work with 2 delimiter
;,
.
Here a run,i had to take it in step probably some method that could do it sorter.
The problem is that Pandas has 100's of methods,so if not using it often it easy to forget.
>>> import pandas as pd
>>> from pandas import DataFrame
>>> df = pd.read_csv('student.csv', delimiter=';')
>>> df
Student Age Subject
0 Adam 13 Maths,Biology
1 Alex 14 English
2 Shawn 15 Geography
3 Nathan 16 Maths,Biology,English
>>> b = DataFrame(df.Subject.str.split(',').tolist(), index=[df.Student, df.Age]).stack()
>>> b
Student Age
Adam 13 0 Maths
1 Biology
Alex 14 0 English
Shawn 15 0 Geography
Nathan 16 0 Maths
1 Biology
2 English
dtype: object
>>> # Clean up
>>> c = b.to_frame()
>>> c = c.reset_index()
>>> c = c.drop('level_2', 1)
>>> c = c.rename(columns = {0:'Subject'})
>>> c
Student Age Subject
0 Adam 13 Maths
1 Adam 13 Biology
2 Alex 14 English
3 Shawn 15 Geography
4 Nathan 16 Maths
5 Nathan 16 Biology
6 Nathan 16 English
>>> # It's now a working dataframe,eg sorting
>>> c.sort_values('Subject')
Student Age Subject
1 Adam 13 Biology
5 Nathan 16 Biology
2 Alex 14 English
6 Nathan 16 English
3 Shawn 15 Geography
0 Adam 13 Maths
4 Nathan 16 Maths
(Jan-29-2018, 06:16 PM)snippsat Wrote: [ -> ] (Jan-29-2018, 07:43 AM)nsh_py Wrote: [ -> ]I tried few things with pandas but it is not giving me the output i needed.
You should show what you tried.
That's not the most easy structured data to work with 2 delimiter ;,
.
Here a run,i had to take it in step probably some method that could do it sorter.
The problem is that Pandas has 100's of methods,so if not using it often it easy to forget.
>>> import pandas as pd
>>> from pandas import DataFrame
>>> df = pd.read_csv('student.csv', delimiter=';')
>>> df
Student Age Subject
0 Adam 13 Maths,Biology
1 Alex 14 English
2 Shawn 15 Geography
3 Nathan 16 Maths,Biology,English
>>> b = DataFrame(df.Subject.str.split(',').tolist(), index=[df.Student, df.Age]).stack()
>>> b
Student Age
Adam 13 0 Maths
1 Biology
Alex 14 0 English
Shawn 15 0 Geography
Nathan 16 0 Maths
1 Biology
2 English
dtype: object
>>> # Clean up
>>> c = b.to_frame()
>>> c = c.reset_index()
>>> c = c.drop('level_2', 1)
>>> c = c.rename(columns = {0:'Subject'})
>>> c
Student Age Subject
0 Adam 13 Maths
1 Adam 13 Biology
2 Alex 14 English
3 Shawn 15 Geography
4 Nathan 16 Maths
5 Nathan 16 Biology
6 Nathan 16 English
>>> # It's now a working dataframe,eg sorting
>>> c.sort_values('Subject')
Student Age Subject
1 Adam 13 Biology
5 Nathan 16 Biology
2 Alex 14 English
6 Nathan 16 English
3 Shawn 15 Geography
0 Adam 13 Maths
4 Nathan 16 Maths
Thanks snippsat. I am also trying with scikit too, will post the code soon.
Depending on this input:
Output:
Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English
You should use the
cvs
module,
sorted
and
itertools.groupby
and
operator.itemgetter
. Sometimes it's good to understand the use of generator expressions and list comprehensions. The module itertools should not be underestimated. I use
operator.itemgetter
to prevent the use of lambda expressions. I use star unpacking inside the tupel. I think this is possible since Python 3.5.
import itertools
import csv
import operator
def sort_by_subject(data):
reader = csv.reader(data.splitlines(), delimiter=';')
# reading the header
header = next(reader)
# generator expression to yield for every Subject a new dataset
# generators are lazy evaluated
data_each_subject = ((*row[0:2], subject) for row in reader for subject in row[2].split(','))
# sort by subject, which is the last field (-1)
# the sorted function iterates over the generator
sorted_data = sorted(data_each_subject, key=operator.itemgetter(-1))
# you could work with the sorted data or use groupby to group results
for key, group in itertools.groupby(sorted_data, operator.itemgetter(-1)):
# key is the object which is used to group
# group contains the group
for name, age, subject in group:
# iterating over the group,
yield name, age, subject
data = """Student;Age;Subject
Adam;13;Maths,Biology
Alex;14;English
Shawn;15;Geography
Nathan;16;Maths,Biology,English"""
import pprint
pprint.pprint(list(sort_by_subject(data)))