Python Forum
Feature Selection with different units and strings
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Feature Selection with different units and strings
#1
Hi, I am new to machine learning and I am working on predicting patient outcomes using their data in the chart. I am having a hard time with feature selection due to columns that have different units, and also have strings. Here a small sample of the data.

Age 85,89,79
Cough yes,no,yes
WBC 24,29,89
O2Sat 57%,90%,85%

All 4 of these are different. Most of my data is yes and no. So at first i tried binary coding, and changed all the yes and no's to 0 and 1's. This worked well, but would also choose features like age because the average was always high because it was not binary. I was using Scikit learn, but it only works with numeric values. Is there another program that i could use to help with my feature selection?
Reply
#2
how is the data stored?
For example in a text file with newline separation,
or possibly spreadsheet, etc.

Knowing this is helpful in determining the best way to represent the data in python.
I'm expecting that a dictionary, saved as a json file would be a good choice, example with your data

Steps:
  • create PatientDataPath.py
  • create CreatePatientDataJson.py
  • Run CreatePatientDataJson to create json file:
    python CreatePatientDataJson.py
  • create and run TestPatientDataJsonFile.py
First a program to set up paths:
Name this: PatientDataPath.py:
from pathlib import Path
import os


class PatientDataPath:
    def __init__(self):
        # Assert starting directory is same as script path
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.homepath = Path('.')
        # then create a patient data path:
        self.datapath = self.homepath / 'data'
        self.datapath.mkdir(exist_ok=True)

        self.patientdatafile = self.datapath / 'PatientData.json'

if __name__ == '__main__':
    PatientDataPath()
To create json data file
Name this: CreatePatientDataJson.py:
import json
from PatientDataPath import PatientDataPath


class CreatePatientDataJson:
    def __init__(self):
        ppath = PatientDataPath()
        patient_data = {
            'patient1': {
                'Age': 85,
                'Cough': 'Yes',
                'WBC': 24,
                'O2Sat_percentage': 57
            },
            'patient2': {
                'Age': 85,
                'Cough': 'No',
                'WBC': 24,
                'O2Sat_percentage': 57
            },
            'patient3': {
                'Age': 85,
                'Cough': 'Yes',
                'WBC': 24,
                'O2Sat_percentage': 57
            }
        }

        with ppath.patientdatafile.open('w') as fp:
            json.dump(patient_data, fp)

if __name__ == '__main__':
    CreatePatientDataJson()
Then a program to test the json file (which also shows how to read the data)
Name this: TestPatientDataJsonFile.py:
import json
from PatientDataPath import PatientDataPath


class TestPatientDataJsonFile:
    def __init__(self):
        self.ppath = PatientDataPath()
        self.PatientData = self.get_patient_data()
    
    def get_patient_data(self):
        with self.ppath.patientdatafile.open() as fp:
            return json.load(fp)

    def show_data(self):
        for patient, pdata in self.PatientData.items():
            print(f"\nPatient: {patient}")
            print(f"    Age: {pdata['Age']}")
            print(f"    Cough: {pdata['Cough']}")
            print(f"    WBC: {pdata['WBC']}")
            print(f"    O2Sat: {pdata['O2Sat_percentage']}")

def main():
    tpjf = TestPatientDataJsonFile()
    tpjf.show_data()

if __name__ == '__main__':
    main()
Results:

Patient: patient1
    Age: 85
    Cough: Yes
    WBC: 24
    O2Sat: 57

Patient: patient2
    Age: 85
    Cough: No
    WBC: 24
    O2Sat: 57

Patient: patient3
    Age: 85
    Cough: Yes
    WBC: 24
    O2Sat: 57
Reply
#3
Me, I would use Pandas, TensorFlow, and Keras. Example code is from a project I did
Say you store your data in an Excel style spreadsheet. Pandas can read that and create a dataframe from it. Think of a dataframe as basically being a super spreadsheet.
column_names = ['Rank','Interview_Score','Scale','Name','ID','Degree','MedSchool','Zone','Grad','Step1','Step2','Sex','Couples','Couple_Prog','NA','Comment']
df = pd.read_excel(r'/content/drive/My Drive/ia1.xlsx', header=None)
df.columns = column_names
Use map on your dataframe to do one hot encoding.
df['Sex'] = df['Sex'].map(lambda x: {'M': 10, 'F': 11, 'm': 10, 'f':11}.get(x) )
Scale the larger numbers
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() 
column_names_to_normalize = ['Interview_Score','Scale','Step1','Step2', 'Grad']
x = df[column_names_to_normalize].values
x_scaled = scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp
df.astype('float')
Then split the data into training, validation sets and do your analysis.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Feature Selection in Machine Learning shiv11 4 1,799 Apr-09-2024, 02:22 PM
Last Post: DataScience
  Random Forest to Identify Page: Feature Selection JaneTan 0 1,306 Oct-14-2021, 09:40 AM
Last Post: JaneTan

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020