Posts: 6
Threads: 3
Joined: Mar 2021
Hi there,
This is a dataframe with glacial lakes and their volume:
1 2 3 |
data = { 'Volume' : [ '50' , '100' , '100' , '90' , '60' , '200' , '150' , '80' , '70' ],
'GlacierID' : [ 'a' , 'a' , 'b' , 'b' , 'c' , 'd' , 'd' , 'd' , 'e' ]}
dataframe = pd.DataFrame(data, columns = [ 'Volume' , 'GlacierID' ])
|
I would like to create a third column 'LakeID' assigning every value a new ID, which is composed of 'GlacierID' and an extra number according to the number of lakes. Ideally the largest volume gets number 1.
The output should look like this: (My real dataframe is much larger, so I can't do it manually)
1 2 3 4 5 6 7 8 9 10 |
Volume GlacierID LakeID
0 50 a a2
1 100 a a1
2 100 b b1
3 90 b b2
4 60 c c1
5 200 d d1
6 150 d d2
7 80 d d3
8 70 e e1
|
I really don't have any idea how to do this so I cannot include any further coding.
Thank you!
Posts: 218
Threads: 27
Joined: May 2018
Mar-27-2021, 09:05 PM
(This post was last modified: Mar-27-2021, 09:05 PM by eddywinch82.)
Hi timste,
Try these lines of Code,
For the Dataset you have provided :-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import pandas as pd
data = { 'Volume' : [ '50' , '100' , '100' , '90' , '60' , '200' , '150' , '80' , '70' ],
'GlacierID' : [ 'a' , 'a' , 'b' , 'b' , 'c' , 'd' , 'd' , 'd' , 'e' ]}
dataframe = pd.DataFrame(data, columns = [ 'Volume' , 'GlacierID' , 'LakeID' ])
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '50' ) & dataframe[ 'GlacierID' ]. str .contains( 'a' ), 'LakeID' ] = 'a2'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '100' ) & dataframe[ 'GlacierID' ]. str .contains( 'a|b' ), 'LakeID' ] = 'a1' , 'b1'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '90' ) & dataframe[ 'GlacierID' ]. str .contains( 'b' ), 'LakeID' ] = 'b2'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '60' ) & dataframe[ 'GlacierID' ]. str .contains( 'c' ), 'LakeID' ] = 'c1'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '200' ) & dataframe[ 'GlacierID' ]. str .contains( 'd' ), 'LakeID' ] = 'd1'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '150' ) & dataframe[ 'GlacierID' ]. str .contains( 'd' ), 'LakeID' ] = 'd2'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '80' ) & dataframe[ 'GlacierID' ]. str .contains( 'd' ), 'LakeID' ] = 'd3'
dataframe.loc[dataframe[ 'Volume' ]. str .contains( '70' ) & dataframe[ 'GlacierID' ]. str .contains( 'e' ), 'LakeID' ] = 'e1'
dataframe
|
Those lines of Code, gives the DataFrame Output you wan't.
I will try to come up with a better way, using np.where Code, because that would take up less lines of Code, than typing many lines like I have, if your Dataset is actually much larger.
Best Regards
Eddie Winch
Posts: 6
Threads: 3
Joined: Mar 2021
Hi Eddie
Thank you for your reply. Unfortunately my real dataframe contains 3600 rows. Here a brief look:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
ObjectID Vol_Min Vol_Max Vol_Range Vol_Mean Vol_Std \
0 1 68 68 0 68.237305 0.000000
1 2 21 65 44 43.237305 22.143555
2 3 16 105 89 62.713623 33.093563
3 4 0 350 350 147.524007 110.684999
4 5 20 20 0 20.117188 0.000000
... ... ... ... ... ... ...
3595 3596 22 332 311 152.536350 72.093350
3596 3597 76 76 0 76.196289 0.000000
3597 3598 3 3 0 2.514648 0.000000
3598 3599 12 183 171 87.805176 58.601078
3599 3600 2 2 0 2.441406 0.000000
Volume Tiefe_Min Tiefe_Max Tiefe_Range Tiefe_Mean Tiefe_Std \
0 68.237305 1 1 0 0.682373 0.000000
1 86.474609 0 1 0 0.432373 0.221436
2 250.854492 0 1 1 0.627136 0.330936
3 3540.576172 0 4 3 1.475240 1.106850
4 20.117188 0 0 0 0.201172 0.000000
... ... ... ... ... ... ...
3595 2745.654297 0 3 3 1.525363 0.720934
3596 76.196289 1 1 0 0.761963 0.000000
3597 2.514648 0 0 0 0.025146 0.000000
3598 1229.272461 0 2 2 0.878052 0.586011
3599 2.441406 0 0 0 0.024414 0.000000
Elevation Glacier_ID A_sgi2016 h_mean h_max V_year \
0 2335.817871 A50j / 13 0.1625 23.590000 46.389999 0.003862
1 2346.285645 A50j / 13 0.1625 23.590000 46.389999 0.003862
2 2494.893555 A50j / 13 0.1625 23.590000 46.389999 0.003862
3 2734.219727 A50k / 04 1.4310 46.259998 132.179993 0.066240
4 2725.788086 A50k / 04 1.4310 46.259998 132.179993 0.066240
... ... ... ... ... ... ...
3595 2526.086426 B85 / 01 0.4153 23.110001 37.660000 0.009572
3596 2529.571289 B85 / 01 0.4153 23.110001 37.660000 0.009572
3597 2627.306152 B85 / 01 0.4153 23.110001 37.660000 0.009572
3598 2529.571289 B85 / 01 0.4153 23.110001 37.660000 0.009572
3599 2597.728027 B85 / 01 0.4153 23.110001 37.660000 0.009572
V_2016 Shape_Length Shape_Area xy
0 0.003141 40.0 100.0 A50j / 13
1 0.003141 60.0 200.0 A50j / 13
2 0.003141 100.0 400.0 A50j / 13
3 0.056820 260.0 2400.0 A50k / 04
4 0.056820 40.0 100.0 A50k / 04
... ... ... ... ...
3595 0.009572 280.0 1800.0 B85 / 01
3596 0.009572 40.0 100.0 B85 / 01
3597 0.009572 40.0 100.0 B85 / 01
3598 0.009572 180.0 1400.0 B85 / 01
3599 0.009572 40.0 100.0 B85 / 01
[ 3600 rows x 22 columns]
|
The two columns regarding the LakeID are column [7] 'Volume' and column [14] 'Glacier_ID'.
Thank you and have a nice weekend!
Tim
Posts: 1,358
Threads: 2
Joined: May 2019
This gets you tags for Lake IDs by combining the glacier and volume. Sorting, reindexing, then getting the glacier and index combined into an ID is another idea.
1 2 3 4 5 6 7 8 9 |
import pandas as pd
data = { 'Volume' : [ '50' , '100' , '100' , '90' , '60' , '200' , '150' , '80' , '70' ],
'GlacierID' : [ 'a' , 'a' , 'b' , 'b' , 'c' , 'd' , 'd' , 'd' , 'e' ]}
df = pd.DataFrame(data, columns = [ 'Volume' , 'GlacierID' ])
df[ 'LakeID' ] = df[ 'GlacierID' ] + df[ 'Volume' ]
df = df.sort_values([ 'GlacierID' , 'Volume' ])
df.head( 10 )
|
Posts: 4
Threads: 0
Joined: Apr 2021
(Mar-27-2021, 10:52 AM)timste Wrote: Hi there,
This is a dataframe with glacial lakes and their volume:
1 2 3 |
data = { 'Volume' : [ '50' , '100' , '100' , '90' , '60' , '200' , '150' , '80' , '70' ],
'GlacierID' : [ 'a' , 'a' , 'b' , 'b' , 'c' , 'd' , 'd' , 'd' , 'e' ]}
dataframe = pd.DataFrame(data, columns = [ 'Volume' , 'GlacierID' ])
|
I would like to create a third column 'LakeID' assigning every value a new ID, which is composed of 'GlacierID' and an extra number according to the number of lakes. Ideally the largest volume gets number 1.
The output should look like this: (My real dataframe is much larger, so I can't do it manually)
1 2 3 4 5 6 7 8 9 10 |
Volume GlacierID LakeID
0 50 a a2
1 100 a a1
2 100 b b1
3 90 b b2
4 60 c c1
5 200 d d1
6 150 d d2
7 80 d d3
8 70 e e1
|
I really don't have any idea how to do this so I cannot include any further coding.
Thank you!
Posts: 4
Threads: 0
Joined: Apr 2021
import pandas as pd
import numpy
data = {'Volume': ['50', '100', '100', '90', '60', '200', '150', '80', '70'],
'GlacierID': ['a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'e']}
df = pd.DataFrame(data, columns=['Volume', 'GlacierID'])
df.sort_values(by=['GlacierID', 'Volume'], inplace=True)
seq = []
[seq.extend(seq_no) for seq_no in df.groupby('GlacierID').apply(lambda df: numpy.arange(1, len(df) + 1))]
df['seq'] = seq
df['LakeID'] = df['GlacierID']+df['seq'].astype(str)
df.drop(columns=['seq'], inplace=True)
print(df)
I hope this solution will work. let me know if any assistance require.
Posts: 6
Threads: 3
Joined: Mar 2021
Hi there
Thank you for your idea!
I recently sorted the problem using a for-loop with a empty id column and if current_id == old_id then id = id + 1
and using iloc for the transformation.
Your idea seems more elegant though!
Have a nice easterweeekend!
Tim
Posts: 4
Threads: 0
Joined: Apr 2021
(Apr-01-2021, 09:49 AM)timste Wrote: Hi there
Thank you for your idea!
I recently sorted the problem using a for-loop with a empty id column and if current_id == old_id then id = id + 1
and using iloc for the transformation.
Your idea seems more elegant though!
Have a nice easterweeekend!
Tim
Posts: 4
Threads: 0
Joined: Apr 2021
appreciated
|