Feature Selection with different units and strings

ltloug01 · Oct-13-2020, 01:55 PM

Hi, I am new to machine learning and I am working on predicting patient outcomes using their data in the chart. I am having a hard time with feature selection due to columns that have different units, and also have strings. Here a small sample of the data.

Age 85,89,79
Cough yes,no,yes
WBC 24,29,89
O2Sat 57%,90%,85%

All 4 of these are different. Most of my data is yes and no. So at first i tried binary coding, and changed all the yes and no's to 0 and 1's. This worked well, but would also choose features like age because the average was always high because it was not binary. I was using Scikit learn, but it only works with numeric values. Is there another program that i could use to help with my feature selection?

**Larz60+** · Oct-13-2020, 06:58 PM

how is the data stored?
For example in a text file with newline separation,
or possibly spreadsheet, etc.

Knowing this is helpful in determining the best way to represent the data in python.
I'm expecting that a dictionary, saved as a json file would be a good choice, example with your data

Steps:

create PatientDataPath.py
create CreatePatientDataJson.py
Run CreatePatientDataJson to create json file:
```
python CreatePatientDataJson.py
```
create and run TestPatientDataJsonFile.py

First a program to set up paths:
Name this: PatientDataPath.py:

from pathlib import Path
import os


class PatientDataPath:
    def __init__(self):
        # Assert starting directory is same as script path
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.homepath = Path('.')
        # then create a patient data path:
        self.datapath = self.homepath / 'data'
        self.datapath.mkdir(exist_ok=True)

        self.patientdatafile = self.datapath / 'PatientData.json'

if __name__ == '__main__':
    PatientDataPath()

To create json data file
Name this: CreatePatientDataJson.py:

import json
from PatientDataPath import PatientDataPath


class CreatePatientDataJson:
    def __init__(self):
        ppath = PatientDataPath()
        patient_data = {
            'patient1': {
                'Age': 85,
                'Cough': 'Yes',
                'WBC': 24,
                'O2Sat_percentage': 57
            },
            'patient2': {
                'Age': 85,
                'Cough': 'No',
                'WBC': 24,
                'O2Sat_percentage': 57
            },
            'patient3': {
                'Age': 85,
                'Cough': 'Yes',
                'WBC': 24,
                'O2Sat_percentage': 57
            }
        }

        with ppath.patientdatafile.open('w') as fp:
            json.dump(patient_data, fp)

if __name__ == '__main__':
    CreatePatientDataJson()

Then a program to test the json file (which also shows how to read the data)
Name this: TestPatientDataJsonFile.py:

import json
from PatientDataPath import PatientDataPath


class TestPatientDataJsonFile:
    def __init__(self):
        self.ppath = PatientDataPath()
        self.PatientData = self.get_patient_data()
    
    def get_patient_data(self):
        with self.ppath.patientdatafile.open() as fp:
            return json.load(fp)

    def show_data(self):
        for patient, pdata in self.PatientData.items():
            print(f"\nPatient: {patient}")
            print(f"    Age: {pdata['Age']}")
            print(f"    Cough: {pdata['Cough']}")
            print(f"    WBC: {pdata['WBC']}")
            print(f"    O2Sat: {pdata['O2Sat_percentage']}")

def main():
    tpjf = TestPatientDataJsonFile()
    tpjf.show_data()

if __name__ == '__main__':
    main()

Results:

Patient: patient1
    Age: 85
    Cough: Yes
    WBC: 24
    O2Sat: 57

Patient: patient2
    Age: 85
    Cough: No
    WBC: 24
    O2Sat: 57

Patient: patient3
    Age: 85
    Cough: Yes
    WBC: 24
    O2Sat: 57

jefsummers · Oct-16-2020, 01:24 AM

Me, I would use Pandas, TensorFlow, and Keras. Example code is from a project I did
Say you store your data in an Excel style spreadsheet. Pandas can read that and create a dataframe from it. Think of a dataframe as basically being a super spreadsheet.

column_names = ['Rank','Interview_Score','Scale','Name','ID','Degree','MedSchool','Zone','Grad','Step1','Step2','Sex','Couples','Couple_Prog','NA','Comment']
df = pd.read_excel(r'/content/drive/My Drive/ia1.xlsx', header=None)
df.columns = column_names

Use map on your dataframe to do one hot encoding.

df['Sex'] = df['Sex'].map(lambda x: {'M': 10, 'F': 11, 'm': 10, 'f':11}.get(x) )

Scale the larger numbers

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() 
column_names_to_normalize = ['Interview_Score','Scale','Step1','Step2', 'Grad']
x = df[column_names_to_normalize].values
x_scaled = scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp
df.astype('float')

Then split the data into training, validation sets and do your analysis.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Feature Selection in Machine Learning	shiv11	4	1,799	Apr-09-2024, 02:22 PM Last Post: DataScience
	Random Forest to Identify Page: Feature Selection	JaneTan	0	1,306	Oct-14-2021, 09:40 AM Last Post: JaneTan

Feature Selection with different units and strings

User Panel Messages

Announcements