Posts: 6
Threads: 3
Joined: Mar 2021
Hi there,
This is a dataframe with glacial lakes and their volume:
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
dataframe= pd.DataFrame(data, columns = ['Volume', 'GlacierID']) I would like to create a third column 'LakeID' assigning every value a new ID, which is composed of 'GlacierID' and an extra number according to the number of lakes. Ideally the largest volume gets number 1.
The output should look like this: (My real dataframe is much larger, so I can't do it manually)
Volume GlacierID LakeID
0 50 a a2
1 100 a a1
2 100 b b1
3 90 b b2
4 60 c c1
5 200 d d1
6 150 d d2
7 80 d d3
8 70 e e1 I really don't have any idea how to do this so I cannot include any further coding.
Thank you!
Posts: 218
Threads: 27
Joined: May 2018
Mar-27-2021, 09:05 PM
(This post was last modified: Mar-27-2021, 09:05 PM by eddywinch82.)
Hi timste,
Try these lines of Code,
For the Dataset you have provided :-
import pandas as pd
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
dataframe= pd.DataFrame(data, columns = ['Volume', 'GlacierID', 'LakeID'])
dataframe.loc[dataframe['Volume'].str.contains('50') & dataframe['GlacierID'].str.contains('a'), 'LakeID'] = 'a2'
dataframe.loc[dataframe['Volume'].str.contains('100') & dataframe['GlacierID'].str.contains('a|b'), 'LakeID'] = 'a1','b1'
dataframe.loc[dataframe['Volume'].str.contains('90') & dataframe['GlacierID'].str.contains('b'), 'LakeID'] = 'b2'
dataframe.loc[dataframe['Volume'].str.contains('60') & dataframe['GlacierID'].str.contains('c'), 'LakeID'] = 'c1'
dataframe.loc[dataframe['Volume'].str.contains('200') & dataframe['GlacierID'].str.contains('d'), 'LakeID'] = 'd1'
dataframe.loc[dataframe['Volume'].str.contains('150') & dataframe['GlacierID'].str.contains('d'), 'LakeID'] = 'd2'
dataframe.loc[dataframe['Volume'].str.contains('80') & dataframe['GlacierID'].str.contains('d'), 'LakeID'] = 'd3'
dataframe.loc[dataframe['Volume'].str.contains('70') & dataframe['GlacierID'].str.contains('e'), 'LakeID'] = 'e1'
dataframe Those lines of Code, gives the DataFrame Output you wan't.
I will try to come up with a better way, using np.where Code, because that would take up less lines of Code, than typing many lines like I have, if your Dataset is actually much larger.
Best Regards
Eddie Winch
Posts: 6
Threads: 3
Joined: Mar 2021
Hi Eddie
Thank you for your reply. Unfortunately my real dataframe contains 3600 rows. Here a brief look:
ObjectID Vol_Min Vol_Max Vol_Range Vol_Mean Vol_Std \
0 1 68 68 0 68.237305 0.000000
1 2 21 65 44 43.237305 22.143555
2 3 16 105 89 62.713623 33.093563
3 4 0 350 350 147.524007 110.684999
4 5 20 20 0 20.117188 0.000000
... ... ... ... ... ... ...
3595 3596 22 332 311 152.536350 72.093350
3596 3597 76 76 0 76.196289 0.000000
3597 3598 3 3 0 2.514648 0.000000
3598 3599 12 183 171 87.805176 58.601078
3599 3600 2 2 0 2.441406 0.000000
Volume Tiefe_Min Tiefe_Max Tiefe_Range Tiefe_Mean Tiefe_Std \
0 68.237305 1 1 0 0.682373 0.000000
1 86.474609 0 1 0 0.432373 0.221436
2 250.854492 0 1 1 0.627136 0.330936
3 3540.576172 0 4 3 1.475240 1.106850
4 20.117188 0 0 0 0.201172 0.000000
... ... ... ... ... ... ...
3595 2745.654297 0 3 3 1.525363 0.720934
3596 76.196289 1 1 0 0.761963 0.000000
3597 2.514648 0 0 0 0.025146 0.000000
3598 1229.272461 0 2 2 0.878052 0.586011
3599 2.441406 0 0 0 0.024414 0.000000
Elevation Glacier_ID A_sgi2016 h_mean h_max V_year \
0 2335.817871 A50j/13 0.1625 23.590000 46.389999 0.003862
1 2346.285645 A50j/13 0.1625 23.590000 46.389999 0.003862
2 2494.893555 A50j/13 0.1625 23.590000 46.389999 0.003862
3 2734.219727 A50k/04 1.4310 46.259998 132.179993 0.066240
4 2725.788086 A50k/04 1.4310 46.259998 132.179993 0.066240
... ... ... ... ... ... ...
3595 2526.086426 B85/01 0.4153 23.110001 37.660000 0.009572
3596 2529.571289 B85/01 0.4153 23.110001 37.660000 0.009572
3597 2627.306152 B85/01 0.4153 23.110001 37.660000 0.009572
3598 2529.571289 B85/01 0.4153 23.110001 37.660000 0.009572
3599 2597.728027 B85/01 0.4153 23.110001 37.660000 0.009572
V_2016 Shape_Length Shape_Area xy
0 0.003141 40.0 100.0 A50j/13
1 0.003141 60.0 200.0 A50j/13
2 0.003141 100.0 400.0 A50j/13
3 0.056820 260.0 2400.0 A50k/04
4 0.056820 40.0 100.0 A50k/04
... ... ... ... ...
3595 0.009572 280.0 1800.0 B85/01
3596 0.009572 40.0 100.0 B85/01
3597 0.009572 40.0 100.0 B85/01
3598 0.009572 180.0 1400.0 B85/01
3599 0.009572 40.0 100.0 B85/01
[3600 rows x 22 columns] The two columns regarding the LakeID are column [7] 'Volume' and column [14] 'Glacier_ID'.
Thank you and have a nice weekend!
Tim
Posts: 1,358
Threads: 2
Joined: May 2019
This gets you tags for Lake IDs by combining the glacier and volume. Sorting, reindexing, then getting the glacier and index combined into an ID is another idea.
import pandas as pd
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
df= pd.DataFrame(data, columns = ['Volume', 'GlacierID'])
df['LakeID'] = df['GlacierID']+df['Volume']
df=df.sort_values(['GlacierID','Volume'])
df.head(10)
Posts: 4
Threads: 0
Joined: Apr 2021
(Mar-27-2021, 10:52 AM)timste Wrote: Hi there,
This is a dataframe with glacial lakes and their volume:
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
dataframe= pd.DataFrame(data, columns = ['Volume', 'GlacierID']) I would like to create a third column 'LakeID' assigning every value a new ID, which is composed of 'GlacierID' and an extra number according to the number of lakes. Ideally the largest volume gets number 1.
The output should look like this: (My real dataframe is much larger, so I can't do it manually)
Volume GlacierID LakeID
0 50 a a2
1 100 a a1
2 100 b b1
3 90 b b2
4 60 c c1
5 200 d d1
6 150 d d2
7 80 d d3
8 70 e e1 I really don't have any idea how to do this so I cannot include any further coding.
Thank you!
Posts: 4
Threads: 0
Joined: Apr 2021
import pandas as pd
import numpy
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
df = pd.DataFrame(data, columns=['Volume', 'GlacierID'])
df.sort_values(by=['GlacierID', 'Volume'], inplace=True)
seq = []
[seq.extend(seq_no) for seq_no in df.groupby('GlacierID').apply(lambda df: numpy.arange(1, len(df) + 1))]
df['seq'] = seq
df['LakeID'] = df['GlacierID']+df['seq'].astype(str)
df.drop(columns=['seq'], inplace=True)
print(df)
I hope this solution will work. let me know if any assistance require.
Posts: 6
Threads: 3
Joined: Mar 2021
Hi there
Thank you for your idea!
I recently sorted the problem using a for-loop with a empty id column and if current_id == old_id then id = id + 1
and using iloc for the transformation.
Your idea seems more elegant though!
Have a nice easterweeekend!
Tim
Posts: 4
Threads: 0
Joined: Apr 2021
(Apr-01-2021, 09:49 AM)timste Wrote: Hi there
Thank you for your idea!
I recently sorted the problem using a for-loop with a empty id column and if current_id == old_id then id = id + 1
and using iloc for the transformation.
Your idea seems more elegant though!
Have a nice easterweeekend!
Tim
Posts: 4
Threads: 0
Joined: Apr 2021
appreciated
|