Pipelines for different processing steps

dervast · Jun-04-2020, 08:07 AM

Hi all,
I am having a dataset that I want to try

-different filtering techniques
-different transformations
-different machine learning techniques.

Is there in python a way to set all those different variations I want to try to and then python run all the different possibilities?

I would like to thank you in advance for your reply.
Regards
Alex

**scidam** · Jun-04-2020, 11:52 AM

If you are planning to use scikit-learn, you can define your own preprocessing classes, e.g.

def Prep1(BaseEstimator):
    def __init__(self, par1=None):
        self.par1 = None  # or some value

    def fit(self, X, y=None):
        ...
    def transform(self, X):
        ...


def Prep2(BaseEstimator):
    def __init__(self, par1=None):
        self.par1 = par1  # or some value

    def fit(self, X, y=None):
        ...
    def transform(self, X):
        ...

Make a pipeline,

my_pipe = Pipeline(steps =[('prep1', Prep1()),
                    ('prep2', Prep2()),
                     ... # other steps go here
                   ])

Finally, you can use GridSearch to try all possible parameter values, e.g.

pgrid = {
'prep1__par1': [1, 2, 3],
'prep2__par1':  [True, False], 
# maybe other pars for stages in the pipeline
}

search = GridSearchCV(my_pipe, pgrid, n_jobs=-1)
search.fit(X, y)

This is just pseudocode. As a starting point you can look at the example in official docs.

dervast · Jun-05-2020, 06:08 AM

Thanks I have seen pipelines before and I think is mostly for calling the estimators. Can we have a pre-step on the filtering and data scaling?
Like try this or that filtering and this or that data scaling

**scidam** · Jun-06-2020, 11:01 PM

(Jun-05-2020, 06:08 AM)dervast Wrote: Can we have a pre-step on the filtering and data scaling?

Yes, we can! If you look at the example, it includes StandardScaler as a step in the pipeline. StandardScaler has its own set of kwargs, e.g. with_mean, with_std.
So, you can organize pgrid

pgrid = {
'scaler__with_mean': [True, False],
'svc__C':  [1, 10], 
}

and use all of this in GridSearchCV. Thus, data scaling step is incorporated into one model.
Finally, GridSearchCV allows to find best combination of parameters that influence not only
classification step, but preprocessing (scaling) too. Nothing prevents you to do the same thing for data filtering. Define FilterData class (you can use the source code of StandardScaler as example) and incorporate it into a pipeline.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Convert signal/curve to steps	ben7500	4	1,350	Sep-02-2024, 05:53 AM Last Post: ben7500
	Reading time steps from nc file	ankurk017	1	3,343	Jul-16-2018, 07:06 PM Last Post: woooee

Pipelines for different processing steps

User Panel Messages

Announcements