Python Forum

Full Version: Pipelines for different processing steps
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,
I am having a dataset that I want to try

-different filtering techniques
-different transformations
-different machine learning techniques.

Is there in python a way to set all those different variations I want to try to and then python run all the different possibilities?

I would like to thank you in advance for your reply.
Regards
Alex
If you are planning to use scikit-learn, you can define your own preprocessing classes, e.g.

def Prep1(BaseEstimator):
    def __init__(self, par1=None):
        self.par1 = None  # or some value

    def fit(self, X, y=None):
        ...
    def transform(self, X):
        ...


def Prep2(BaseEstimator):
    def __init__(self, par1=None):
        self.par1 = par1  # or some value

    def fit(self, X, y=None):
        ...
    def transform(self, X):
        ...
Make a pipeline,

my_pipe = Pipeline(steps =[('prep1', Prep1()),
                    ('prep2', Prep2()),
                     ... # other steps go here
                   ])
Finally, you can use GridSearch to try all possible parameter values, e.g.

pgrid = {
'prep1__par1': [1, 2, 3],
'prep2__par1':  [True, False], 
# maybe other pars for stages in the pipeline
}

search = GridSearchCV(my_pipe, pgrid, n_jobs=-1)
search.fit(X, y)
This is just pseudocode. As a starting point you can look at the example in official docs.
Thanks I have seen pipelines before and I think is mostly for calling the estimators. Can we have a pre-step on the filtering and data scaling?
Like try this or that filtering and this or that data scaling
(Jun-05-2020, 06:08 AM)dervast Wrote: [ -> ]Can we have a pre-step on the filtering and data scaling?
Yes, we can! If you look at the example, it includes StandardScaler as a step in the pipeline. StandardScaler has its own set of kwargs, e.g. with_mean, with_std.
So, you can organize pgrid

pgrid = {
'scaler__with_mean': [True, False],
'svc__C':  [1, 10], 
}
and use all of this in GridSearchCV. Thus, data scaling step is incorporated into one model.
Finally, GridSearchCV allows to find best combination of parameters that influence not only
classification step, but preprocessing (scaling) too. Nothing prevents you to do the same thing for data filtering. Define FilterData class (you can use the source code of StandardScaler as example) and incorporate it into a pipeline.