Python Forum

Full Version: Inconsistent sorting with the .sort_values() function
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I am trying to sort rows of data in a pandas Dataframe using the .sort_values() function. The goal is to change the order so that the numbers are displayed from lowest to highest. When I select the column that I want (Inflation Yearly Basis) it reorders it in a randomized way. However, when I select another column from the same Dataframe (Inflation Monthly Basis) that has the same type of data, it displays properly from lowest to highest.

Appreciate any help –– thanks!

Here's the code:

import pandas as pd
import requests
import json
import pandas_datareader.data as webdata
import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import datetime
from datetime import date
import numpy as np
from bs4 import BeautifulSoup


url = 'https://www.global-rates.com/en/economic-indicators/inflation/consumer-prices/cpi/cpi.aspx'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
data = []
table = soup.find('table', cellpadding="2")
table_body = table 

rows = table_body.find_all('tr', {'class': lambda x : x.startswith('tabledata')})
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

global_cpi_rates = pd.DataFrame (data, columns = ['Name','Country/Region','Period','Inflation Monthly Basis','Inflation Yearly Basis'])    
global_cpi_rates = global_cpi_rates.sort_values('Inflation Yearly Basis')
global_cpi_rates
An example would be nice. My guess without any data is that your inflation numbers are strings, not floats. Sorting numerical strings gives funny results like 1000 is less than 2 because in the ascii table "1" is lower than "2".
Finally found the attachment function. Please see attachments for data examples.
Thanks again. I looked into what you said and you are correct. I changed the column to a numeric dtype using apply(pd.to_numeric) and was able to sort it correctly.

The only issue now is that the column name has been removed (assuming because it is no longer a string?) so I can't figure out how to combine the two columns (Country/Region, Inflation Yearly Basis) back into a dataframe so I can plot it. Any ideas?

I've attached the code.
I would do the conversion when reading the csv file.
import pandas as pd
import io

data = io.StringIO(
    """
Name, Percent
A ,5%
B, 100%
C, 20%
D ,3%
"""
)

def percent_to_float(valuestr):
    """Convert percent to floats.  Strip trailing %, convert to float and divide by 100"""
    return float(valuestr.replace("%", "")) / 100.0

df = pd.read_csv(data, converters={"Name": str.strip, "Percent": percent_to_float})
print(df.sort_values(by=["Percent"]))
Output:
Name Percent 3 D 0.03 0 A 0.05 2 C 0.20 1 B 1.00