Python Forum
Customizing an sklearn submodule with cython - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Customizing an sklearn submodule with cython (/thread-27148.html)



Customizing an sklearn submodule with cython - JHogg11 - May-27-2020

I'd like to create a custom DecisionTreeRegressor to be used with sklearn's RandomForestRegressor, however, to get the desired effect, I also need to create a custom Splitter, which determines how the training data is divided into leaf nodes and is written in Cython for sklearn. Here are the relevant sklearn files:

RandomForestRegressor (python): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_forest.py
DecisionTreeRegressor (python): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py
Splitter (cython): https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pyx and https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_splitter.pxd

The way to do this that intuitively makes sense to me is to create copies of the _forest.py file and the entire tree submodule, edit the files as needed to customize the relevant classes, and perform any recompilation steps, however, I want to make sure that I'm compiling in a manner that is consistent with the rest of sklearn. The problem is that I'm not sure what exactly sklearn is doing to compile its cython files and I can't replicate compilation using standard methods (https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html) without getting errors. Upon inspecting the local sklearn module folder, I see that sklearn generates a number of .so files that are not present in the GitHub repo. These appear to be generated by the setup.py file within the tree submodule (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/setup.py).

One thing worth mentioning is the fact that someone using sklearn doesn't have to go through a manual compilation step. With that said, is anyone aware of a way to compile customized cython code from within a python file (i.e., without additional command line operations - similar to how sklearn apparently does it) that allows for easy recompilation in the event that a cython file is edited?