Python Forum

Full Version: How to make my python 3.6 unnesting function correct?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,

I have a csv file containing product details. From which I am extracting required informations.
There is one column in the csv-Options and it contains values like- [], [['Newborn (2.5-6kg)', 'Big Baby (6-9kg)']].

I have written an unnesting function and applying on Options so that values looks like= Newborn (2.5-6kg),Big Baby (6-9kg).

But th eproblem with this function is that it is removing the [] from the Options and so all rows containing Options as [] are removed and so I am loosing the products.

I need to accept empty Options values also so that there is no loss of product while unnesting.

PFB my unnesting function-

def unnest(df, col, col2,reset_index=False):
   col_flat = pd.DataFrame([[i, x]
                      for i, y in df[col].apply(list).iteritems()
                          for x in y ], columns=['I', col]
                           )
   col_flat = col_flat.set_index('I')
   df = df.drop(col, 1)
   df = df.merge(col_flat, left_index=True, right_index=True)

   if reset_index:
       df = df.reset_index(drop=True)
   merchant_product_code = (df['Product code'] + '-' + df[col])
   df['item_group_id'] = df['Product code']
   df['Product code'] = merchant_product_code
   return df
I am applying above function as below-
full_df = unnest(full_df,'Options','product code')
I have attached the full_df csv, please have a look and correct me what I am missing here.
I have no idea what would you like to do with data, but in with csv-s its often easier to use screwdriver (csv module) instead of hammer (pandas).

I would import csv file into ordered dict and would do any manipulation there:

import csv

with open('full.csv', 'r', encoding='UTF-8') as f:
    data = f.read().splitlines()
    products = list(csv.DictReader(data, delimiter=','))
Now all rows are in variable 'products' which is OrderedDict in form of:

Output:
OrderedDict([('Product code', 'PHBC'), ('Category', 'All/For Baby/Feeding /Pacifiers'), ('List price', '6.5'), ('Price', '6.5'), ('Weight', '0'), ('Quantity', '93'), ('Min quantity', '0'), ('Shipping freight', '0'), ('Date added', '07-01-2014 00:00'), .....
One can access needed column by:

products[0]['Options']
which will give:
Output:
"['Newborn (2.5-6kg)', 'Big Baby (6-9kg)']"
As empty lists are False, I would make needed conversions something like that (it will ignore []):

for row in products:
    if row['Options']:             
         # do your magic here
 
As said, I have no idea what kind of task you would like to accomplish, so it may help you or be totally irrelevant.
(Jul-20-2018, 08:16 AM)perfringo Wrote: [ -> ]As empty lists are False, I would make needed conversions something like that (it will ignore []):

Actually, because this comes from csv, this would be not list/empty list, but simple str, so "[]" would not be a empty list, thus your last snippet will not work as expected.

I guess, when information was scraped from the site and the csv was created the list of options was written as str.
Can you influence the format of the file?
(Jul-20-2018, 08:28 AM)buran Wrote: [ -> ]
(Jul-20-2018, 08:16 AM)perfringo Wrote: [ -> ]As empty lists are False, I would make needed conversions something like that (it will ignore []):

Actually, because this comes from csv, this would be not list/empty list, but simple str, so "[]" would not be a empty list, thus your last snippet will not work as expected.

You are right! I stand corrected. One should always inspect data types with precision.
in this case it's relatively safe to use eval

>>> product_options = eval("[['Newborn (2.5-6kg)', 'Big Baby (6-9kg)']]")
>>> product_options
[['Newborn (2.5-6kg)', 'Big Baby (6-9kg)']]
>>> product_options[0][0]
'Newborn (2.5-6kg)'
>>>
(Jul-20-2018, 08:40 AM)perfringo Wrote: [ -> ]
(Jul-20-2018, 08:28 AM)buran Wrote: [ -> ]Actually, because this comes from csv, this would be not list/empty list, but simple str, so "[]" would not be a empty list, thus your last snippet will not work as expected.

You are right! I stand corrected. One should always inspect data types with precision.

One way is to use re to parse list out of string.

import re

def string_to_list(s):
    '''Parse string and return list inside string'''
    refine = re.sub("[\[\],\s]","",s)
    splitted = re.split("[\'\"]", refine)
    result = [s for s in splitted if s != '']
    return result
Then:

for row in products:
    if string_to_list(row['Options']):
        # do your magic here
Output on file provided is:

Output:
['Newborn(2.5-6kg)', 'BigBaby(6-9kg)'] ['Newborn(2.5-6kg)', 'BigBaby(6-9kg)', 'Preemie(1.5-2.5kg)'] ['Newborn(2.5-6kg)', 'BigBaby(6-9kg)'] ['Newborn(2.5-6kg)', 'BigBaby(6-9kg)'] ['Newborn(2.5-6kg)', 'BigBaby(6-9kg)'] ['Newborn(2.5-6kg)', 'BigBaby(6-9kg)']
Note, that result of function is list; after doing magic it would be probably good idea to convert result back to string.
Hi All,

I am working on a ETL process using Python3.6.
For now I am facing one issue while unnesting 'Options' column in my dataframe.

Currently I am downloading two csv files from FTP location and then performing cleaning/transformation on them.
For one csv file my unnesting is working as expected by transforming the product code and options values as below-
[Image: view?usp=sharing]

But for the second csv file, options column is not unnested properly like below-
[Image: view?usp=sharing]

My code is in below link-
my python code

I have attached the dataframes generated from csvs with minimal columns and full dataframe also so that anyone can test.

Please see once and let me know if need anything from my side.