Python Forum
ValueError: Index contains duplicate entries, cannot reshape” error when I try to use
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ValueError: Index contains duplicate entries, cannot reshape” error when I try to use
#1
I have data which includes id , gender , collected time test name and Test values , Units of measurement

Test Names will include all tests that a patient taken and Value col will have its corresponding test result.

I want to analysis on only certain tests and retrieve corresponding test values from "value" col . The analysis will be on those tests and their values , so I thought it would be good idea to pivot on those test names and test values. However when I add TS col I get an error and adding any other test name in the multiindex code does not throw an error.

Steps:
Steps:

df_s.head(30).dropna()
Here in the below screenshot we can see there multiple test taken for each requisition id:
In the below code Iam only getting tests which I want to do analysis 1#
df_s2 = df_s[df_s['Test'].isin(['TOTAL TRIIODOTHYRONINE (T3)','TOTAL THYROXINE (T4)','FREE THYROID 3','FREE THYROID 4','Human Chorionic Gonadotropin (hCG)','BILRUBIN'])]
2# Resetting the index:
df_s3=df_s2.set_index(['ID', 'Name', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Result', 'Units']).reset_index()
3# applything multiindex
idx = pd.MultiIndex.from_arrays([df_s3['ID'], df_s3['Name'], df_s3['Age'], df_s3['Sex'], df_s3['CT'],df_s3['RT'], df_s3['Units'], df_s3['Test'],  ])
#, df_s3['Unit of Measure']
df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='')
df_s5.columns.name = None
df_s6= df_s5.reset_index()
df_s6.head(100)
I get this result if do not add TSH (from Test Col)
Output:
ID Name Age Sex CT RT Units BILRUBIN FREE THYROID 3 FREE THYROID 4 Human Chorionic Gonadotropin (hCG) TOTAL THYROXINE (T4) TOTAL TRIIODOTHYRONINE (T3) 0 RQ0556048140 Madhuri Dev 28 Years Female 8/01/2019 10:30 8/01/2019 14:39 ng/dl 1.93 1 RQ0556048140 Madhuri Dev 28 Years Female 8/01/2019 10:30 8/01/2019 14:39 μg/dl 18.70 2 RQ06916497688 B/O BARSARANI BISWAL 10 Month(s) 15 Day(s) Female 17/06/2019 12:00 17/06/2019 14:41 ng/dl 179.82 3 RQ06916497688 B/O BARSARANI BISWAL 10 Month(s) 15 Day(s) Female 17/06/2019 12:00 17/06/2019 14:41 μg/dl 18.30 4 RQ09026492462 Sri Hemanta Sarkar 46 Years Male 30/04/2018 08:35 30/04/2018 14:30 ng/dl 1.15 5 RQ09026492462 Sri Hemanta Sarkar 46 Years Male 30/04/2018 08:35 30/04/2018 14:30 μg/dl 9.20 6 RQ1001489038840 RENUKA MAHAPATRA 65 Years Female 28/07/2019 08:20 28/07/2019 13:16 ng/dl 90 7 RQ1001489038840 RENUKA MAHAPATRA 65 Years Female 28/07/2019 08:20 28/07/2019 13:16 μg/dl 7.40 8 RQ1004195473943 Mrs Mamata Samantray 45 Years Female 23/09/2017 11:40 23/09/2017 13:13 ng/dl 1.58 9 RQ1004195473943 Mrs Mamata Samantray 45 Years Female 23/09/2017 11:40 23/09/2017 13:13 μg/dl 15.60 10 RQ1009478939089 Sabita Lenka 30 Years Female 11/06/2017 13:00 12/06/2017 10:10 ng/dl 1.78 11 RQ1009478939089 Sabita Lenka 30 Years Female 11/06/2017 13:00 12/06/2017 10:10 μg/dl 12.50 12 RQ1012532242276 Sanjukta Mishra 47 Years Female 19/03/2018 11:30 19/03/2018 16:35 ng/dl 0.66 13 RQ1012532242276 Sanjukta Mishra 47 Years Female 19/03/2018 11:30 19/03/2018 16:35 μg/dl 6.40 14 RQ1013250484240 Mrs Abha Kansari 45 Years Female 27/07/2017 11:20 27/07/2017 12:42 ng/dl NaN 15 RQ1013250484240 Mrs Abha Kansari 45 Years Female 27/07/2017 11:20 27/07/2017 12:42 μg/dl NaN 16 RQ1013716969697 Madhusmita Sahu 17 Years Female 31/07/2017 11:40 31/07/2017 13:38 ng/dl 0.29 17 RQ1013716969697 Madhusmita Sahu 17 Years Female 31/07/2017 11:40 31/07/2017 13:38 μg/dl 0.70 18 RQ10189073348 Sumati Mahapatra 55 Years Female 11/02/2017 09:30 11/02/2017 14:14 ng/dl 0.90 19 RQ10189073348 Sumati Mahapatra 55 Years Female 11/02/2017 09:30 11/02/2017 14:14 μg/dl 9.10 20 RQ101981055296 NARMADA GUPTA 50 Years Female 23/08/2019 09:45 23/08/2019 17:25 ng/dl 105 21 RQ101981055296 NARMADA GUPTA 50 Years Female 23/08/2019 09:45 23/08/2019 17:25 μg/dl 7.10 22 RQ102281766132 Pyari Xalxo 39 Years Female 4/03/2017 10:10 4/03/2017 13:10 mIU/ml 28640 23 RQ102281766132 Pyari Xalxo 39 Years Female 4/03/2017 10:10 4/03/2017 13:10 ng/dl 1.67 24 RQ102281766132 Pyari Xalxo 39 Years Female 4/03/2017 10:10 4/03/2017 13:10 μg/dl 13.10 25 RQ1023270930913 VICTORIA KISPOTTA 42 Years Female 9/09/2019 11:50 9/09/2019 12:22 ng/dl 82 26 RQ1023270930913 VICTORIA KISPOTTA 42 Years Female 9/09/2019 11:50 9/09/2019 12:22 μg/dl 5.78 27 RQ1026366989473 PRATIMA PATNAIK 38 Years Female 8/07/2019 01:15 8/07/2019 16:37 ng/dl 88.55 28 RQ1026366989473 PRATIMA PATNAIK 38 Years Female 8/07/2019 01:15 8/07/2019 16:37 μg/dl 10.70 29 RQ1028984992315 Agastin Horo 40 Years Female 8/02/2017 10:20 8/02/2017 16:44 ng/dl 0.81 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 70 RQ1076842665319 Puspalata Behera 23 Years Female 28/02/2017 11:30 28/02/2017 13:29 μg/dl 7.90 71 RQ1078176595194 Pramadini Chhatoi 37 Years Female 31/03/2018 10:00 31/03/2018 13:54 ng/dl 0.96 72 RQ1078176595194 Pramadini Chhatoi 37 Years Female 31/03/2018 10:00 31/03/2018 13:54 μg/dl 9.80 73 RQ1082829630987 Reena Acharya 52 Years Female 13/11/2017 09:10 13/11/2017 11:49 ng/dl 1.07 74 RQ1082829630987 Reena Acharya 52 Years Female 13/11/2017 09:10 13/11/2017 11:49 μg/dl 11.30 75 RQ1084664624181 NIBEDITA KAR 36 Years Female 29/05/2019 10:14 29/05/2019 10:14 ng/dl 97.25 76 RQ1084664624181 NIBEDITA KAR 36 Years Female 29/05/2019 10:14 29/05/2019 10:14 μg/dl 8.10 77 RQ108506693161 Vinod Hemrom 40 Years Male 24/10/2018 12:00 24/10/2018 16:24 ng/dl 1.26 78 RQ108506693161 Vinod Hemrom 40 Years Male 24/10/2018 12:00 24/10/2018 16:24 μg/dl 11.80 79 RQ109122773470 Tara Bhadur 23 Years Female 23/06/2018 11:30 23/06/2018 15:11 ng/dl 1.35 80 RQ109122773470 Tara Bhadur 23 Years Female 23/06/2018 11:30 23/06/2018 15:11 μg/dl 7.70 81 RQ109263648697 Jyoti Thakur 35 Years Female 15/09/2018 11:30 15/09/2018 16:22 ng/dl 1.01 82 RQ109263648697 Jyoti Thakur 35 Years Female 15/09/2018 11:30 15/09/2018 16:22 μg/dl 9.50 83 RQ1093448652128 PUSPITA MISHRA 23 Years Female 30/04/2019 09:45 30/04/2019 19:37 ng/dl 83 84 RQ1093448652128 PUSPITA MISHRA 23 Years Female 30/04/2019 09:45 30/04/2019 19:37 μg/dl 6.10 85 RQ109359752914 HIRAMANI KACHAP 30 Years Female 14/08/2019 03:00 14/08/2019 18:26 ng/dl 88 86 RQ109359752914 HIRAMANI KACHAP 30 Years Female 14/08/2019 03:00 14/08/2019 18:26 μg/dl 6.50 87 RQ1097475978863 CHULESWARI PATRA 18 Years Female 1/09/2019 10:30 1/09/2019 11:08 ng/dl 88 88 RQ1097475978863 CHULESWARI PATRA 18 Years Female 1/09/2019 10:30 1/09/2019 11:08 μg/dl 6.90 89 RQ1098576134741 S PATEL 29 Years Female 30/07/2017 10:30 31/07/2017 15:17 ng/dl 1.14 90 RQ1098576134741 S PATEL 29 Years Female 30/07/2017 10:30 31/07/2017 15:17 μg/dl 12.70 91 RQ1098887741955 L MUKHI 32 Years Female 5/08/2019 05:00 5/08/2019 18:02 ng/dl 118 92 RQ1098887741955 L MUKHI 32 Years Female 5/08/2019 05:00 5/08/2019 18:02 μg/dl 8.50 93 RQ1099369598030 Amit Sahoo 38 Years Male 28/03/2019 09:14 28/03/2019 09:51 ng/dl 1.88 94 RQ1099369598030 Amit Sahoo 38 Years Male 28/03/2019 09:14 28/03/2019 09:51 μg/dl 10.70 95 RQ1101382949711 MEHROOM NISHA 50 Years Female 29/04/2019 08:55 29/04/2019 10:43 ng/dl 172.77 96 RQ1101382949711 MEHROOM NISHA 50 Years Female 29/04/2019 08:55 29/04/2019 10:43 μg/dl 10.10 97 RQ1103159227767 Mamata Das Mahapatra 30 Years Female 5/01/2019 11:20 5/01/2019 11:20 ng/dl 1.21 98 RQ1103159227767 Mamata Das Mahapatra 30 Years Female 5/01/2019 11:20 5/01/2019 11:20 μg/dl 9.10 99 RQ1114005283147 Manji Kaur 47 Years Female 19/06/2017 09:00 19/06/2017 10:50 ng/dl
Code with TSH test: Retry 1# with TSH

df_s2 = df_s[df_s['Test'].isin(['TOTAL TRIIODOTHYRONINE (T3)','TOTAL THYROXINE (T4)','THYROID STIMULATING HORMONE (TSH)','FREE THYROID 3','FREE THYROID 4','Human Chorionic Gonadotropin (hCG)','BILRUBIN'])]

df_s3=df_s2.set_index(['ID', 'Name', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Result', 'Units']).reset_index()

idx = pd.MultiIndex.from_arrays([df_s3['ID'], df_s3['Name'], df_s3['Age'], df_s3['Sex'], df_s3['CT'],df_s3['RT'], df_s3['Units'], df_s3['Test'],  ])
#, df_s3['Unit of Measure']
df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='')
df_s5.columns.name = None
df_s6= df_s5.reset_index()
df_s6.head(100)
Output:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-63-82c092f6cd99> in <module> 1 idx = pd.MultiIndex.from_arrays([df_s3['ID'], df_s3['Name'], df_s3['Age'], df_s3['Sex'], df_s3['CT'],df_s3['RT'], df_s3['Units'], df_s3['Test'], ]) ----> 2 df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='') 3 df_s5.columns.name = None 4 df_s6= df_s5.reset_index() 5 df_s6.head(100) C:\BhargaviM\MyAnaconda\lib\site-packages\pandas\core\series.py in unstack(self, level, fill_value) 2897 """ 2898 from pandas.core.reshape.reshape import unstack -> 2899 return unstack(self, level, fill_value) 2900 2901 # ---------------------------------------------------------------------- C:\BhargaviM\MyAnaconda\lib\site-packages\pandas\core\reshape\reshape.py in unstack(obj, level, fill_value) 499 unstacker = _Unstacker(obj.values, obj.index, level=level, 500 fill_value=fill_value, --> 501 constructor=obj._constructor_expanddim) 502 return unstacker.get_result() 503 C:\BhargaviM\MyAnaconda\lib\site-packages\pandas\core\reshape\reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor) 135 136 self._make_sorted_values_labels() --> 137 self._make_selectors() 138 139 def _make_sorted_values_labels(self): C:\BhargaviM\MyAnaconda\lib\site-packages\pandas\core\reshape\reshape.py in _make_selectors(self) 173 174 if mask.sum() < len(self.index): --> 175 raise ValueError('Index contains duplicate entries, ' 176 'cannot reshape') 177 ValueError: Index contains duplicate entries, cannot reshape
Question1 (Retry 1# with TSH ): Please help me with the correct approach, what I understand the error is because once it convert it is not finding any unique index but not sure how to resolve it.

Question2: When I proceeded to go ahead without tsh, after conversion of test col- rows to cols, I get blank values in respective test col ( example T4 col) because 1) the person has taken the test but there is no value in the dataset(python is treating it as Null value and can be imputed/rejected - no issue 2) the patient has not taken this test but has taken atleast one other tests may be T3, hcg etc but not this test- this is considered as string '' . I want to get rid of these rows for amy analysis .. is there an approach while transforming the data to take care of so that I only want the result of the code to have T4 and its value( numeric or null). I do not want a scenario where the person has not taken test at all. OR is there a way to impute these values so I will know the person has taken T4, T3 but not Hcg , bilrubin etc?

Please advise. Long questions but I hope it this explanatory
Output:
Reply
#2
On lines 6 and 7, the only column that is different is THYROXINE and (T4).
Thus a duplicate index.
Reply
#3
(Oct-13-2019, 06:21 PM)Larz60+ Wrote: On lines 6 and 7, the only column that is different is THYROXINE and (T4). Thus a duplicate index.
Thank you Larz60! I think ID is not unique to consider so I added and new ID1 col to have a unique value by adding this code
to the initial dataframe (first one after importing data ) and the rest of the are same

df['ID1'] = range(1, len(df.index)+1)
Output:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-504-78aab1b20f9e> in <module> ----> 1 idx = pd.MultiIndex.from_arrays(df_s3['ID1'] ,[df_s3['ID'], df_s3['Age'], df_s3['Sex'], df_s3['Collected Time'],df_s3['Received Time'], df_s3['Units'], df_s3['Test'], ]) 2 # #, df_s3['Units'] 3 df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='') 4 df_s5.columns.name = None 5 df_s6= df_s5.reset_index() C:\BhargaviM\MyAnaconda\lib\site-packages\pandas\core\indexes\multi.py in from_arrays(cls, arrays, sortorder, names) 1267 # raise ValueError, if not 1268 for i in range(1, len(arrays)): -> 1269 if len(arrays[i]) != len(arrays[i - 1]): 1270 raise ValueError('all arrays must be same length') 1271 TypeError: object of type 'numpy.int64' has no len()
Canyou please help.
Reply
#4
Need more information if you want detailed help.
  • Source of original data, or per-processed data (including 'ID1')
  • Enough code to be able to run from origin to error point
Reply
#5
Here you see few records of the data:
Data-Click here

that is loaded in 'df_s' dataframe. After that I create new column :
df_s['ID1'] = range(1, len(df_s.index)+1)
df_s2 = df_s[df_s['Test'].isin(['TOTAL TRIIODOTHYRONINE (T3)','TOTAL THYROXINE (T4)','THYROID STIMULATING HORMONE (TSH)','FREE THYROID 3','FREE THYROID 4','Human Chorionic Gonadotropin (hCG)','BILRUBIN'])]
 
df_s3=df_s2.set_index(['ID', 'Name', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Result', 'Units']).reset_index()
 
idx = pd.MultiIndex.from_arrays([df_s3['ID'], df_s3['Name'], df_s3['Age'], df_s3['Sex'], df_s3['CT'],df_s3['RT'], df_s3['Units'], df_s3['Test'],  ])
#, df_s3['Unit of Measure']
df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='')
df_s5.columns.name = None
df_s6= df_s5.reset_index()
df_s6.head(100)
Thank you for your response. I added the full code and data used.Hope this helps.Please let me know if you need more details from me.
Reply
#6
Got the data,

Code is not runnable as presented, don't have time to figure out unless so.
Reply
#7
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from IPython.display import display
import re
import datetime
from _datetime import *
df = pd.read_excel (r'<file path and filename.xlsx')
df_s = df.copy()[['ID1','ID', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Name', 'Units']]
df_s['ID1'] = range(1, len(df_s.index)+1)
df_s2 = df_s[df_s['Test'].isin(['TOTAL TRIIODOTHYRONINE (T3)','TOTAL THYROXINE (T4)','THYROID STIMULATING HORMONE (TSH)','FREE THYROID 3','FREE THYROID 4','Human Chorionic Gonadotropin (hCG)','BILRUBIN'])]
  
df_s3=df_s2.set_index(['ID1','ID', 'Name', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Result', 'Units']).reset_index()
  
idx = pd.MultiIndex.from_arrays([df_s3['ID1'],[df_s3['ID'], df_s3['Name'], df_s3['Age'], df_s3['Sex'], df_s3['CT'],df_s3['RT'], df_s3['Units'], df_s3['Test'],  ])
#, df_s3['Unit of Measure']
df_s5 = df_s3.set_index(idx).Test_Result.unstack(fill_value='')
df_s5.columns.name = None
df_s6= df_s5.reset_index()
df_s6.head(100) 
Sorry I just realised, new col ID1 is not added in the code. Please try now if possible

I think I understood the error

As the original issue is saying there are duplicates because there are no unique values , I created a unique col ID1

say the data is like this:
ID1 ID Test Test_Result
1 Re001 T3 0.3
2 Re001 T4 0.4
3 Re002 TSH 4

Now after transforming may be it is not able to determine which value of ID1 to pick in case on Re001 should it be 1 or 2 ? Iam not sure if this is the error but appears to be, Also not sure how to solve original error. is there any other technique that we can apply?

ID1 ID T3 T4 TSH
? Re001 0.3 0.4
2 Re002 4.0

@Larz60+Thank you very much for you help so far. appreciate the time looking into this.
Reply
#8
I have to go out to run an errand, will try when I get back.
Reply
#9
Thank you very much for taking time into looking into this. I have resolved this issue , it was bothering me from last 2 weeks.
here is what I changed, instead of adding unique col ID1 at the beginning , I ve added after picking up necessary rows from Test column (df_s2). Resolving this issue.


df = pd.read_excel (r'<file path and filename.xlsx')
df_s = df.copy()[['ID1','ID', 'Age', 'Sex', 'CT', 'RT', 'Test', 'Test_Name', 'Units']]
df_s2 = df_s[df_s['Test'].isin(['TOTAL TRIIODOTHYRONINE (T3)','TOTAL THYROXINE (T4)','THYROID STIMULATING HORMONE (TSH)','FREE THYROID 3','FREE THYROID 4','Human Chorionic Gonadotropin (hCG)','BILRUBIN'])]
[df_s2['ID1'] = range(1, len(df_s2.index)+1)
   
df_s2.set_index(['ID1'])
   
idx = pd.MultiIndex.from_arrays([df_s2['ID1'],[df_s2['ID'], df_s2['Name'], df_s2['Age'], df_s2['Sex'], df_s2['CT'],df_s2['RT'], df_s2['Units'], df_s2['Test'],  ])

df_s5 = df_s2.set_index(idx).Test_Result.unstack(fill_value='')
df_s5.columns.name = None
df_s6= df_s5.reset_index()
df_s6.head(100)
Reply
#10
Glad to hear all is well
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  pyscript index error while calling input from html form pyscript_dude 2 973 May-21-2023, 08:17 AM
Last Post: snippsat
  Index error help MRsquared 1 762 May-15-2023, 03:28 PM
Last Post: buran
  I'm getting a String index out of range error debian77 7 2,316 Jun-26-2022, 09:50 AM
Last Post: deanhystad
  Numpy reshape mr_byte31 1 1,100 Apr-22-2022, 06:33 PM
Last Post: deanhystad
  Reshape txt file into particular format using python shantanu97 0 1,423 Dec-10-2021, 11:44 AM
Last Post: shantanu97
  Python Error List Index Out of Range abhi1vaishnav 3 2,299 Sep-03-2021, 08:40 PM
Last Post: abhi1vaishnav
  Strange error ValueError: dimension mismatch Anldra12 0 1,960 Aug-17-2021, 07:54 AM
Last Post: Anldra12
  Index error - columns vs non-column Vinny 3 4,913 Aug-09-2021, 04:46 PM
Last Post: snippsat
  How to resolve Index Error in my code? codify110 6 3,011 May-22-2021, 11:04 AM
Last Post: supuflounder
  Why getting ValueError : Math domain error in trig. function, math.asin() ? jahuja73 3 3,758 Feb-24-2021, 05:09 PM
Last Post: bowlofred

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020