Python Forum
Column Transformer with Mixed Types - sklearn
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Column Transformer with Mixed Types - sklearn
#1
Dear community,

Im seeking your help for my data science thesis project. Im looking for scikit learn experts who can help me answer the following question. Your help is greatly appreciated!

I want to create a model which has both numerical and textual features using scikit learn
-I want to scale the numerical features
-I want to process the textual features using a TDIDF vectorizer

In the end, I would like to have a model (a different model in this case I'm starting out with lasso but will be moving to ensemble models later) that can tune parameters. It would be best to also tune the arguments within the TFIDF vectorizer such as the ngram range etc, but i don't know if this is also possible.

Below you can find my current progress. The code returns the error : ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 143 and the array at index 1 has size 2.


from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Lasso

y = df['esg']


text_features = ['sustainability_sentences', 'ESG_sentences_section_3']

numeric_features = ['section_1_esg_count', 'section_2_esg_count', 'section_3_esg_count',
'section_4_esg_count', 'section_5_esg_count', 'section_6_esg_count',
'section_7_esg_count', 'section_8_esg_count', 'total_esg_keyword_count',
'sections_with_esg_keywords', 'section_1_environmental_count',
'section_2_environmental_count', 'section_3_environmental_count',
'section_4_environmental_count', 'section_5_environmental_count',
'section_6_environmental_count', 'section_7_environmental_count',
'section_8_environmental_count', 'total_environmental_keyword_count',
'section_1_social_count', 'section_2_social_count',
'section_3_social_count', 'section_4_social_count',
'section_5_social_count', 'section_6_social_count',
'section_7_social_count', 'section_8_social_count',
'total_social_keyword_count', 'section_1_governance_count',
'section_2_governance_count', 'section_3_governance_count',
'section_4_governance_count', 'section_5_governance_count',
'section_6_governance_count', 'section_7_governance_count',
'section_8_governance_count', 'total_governance_keyword_count']



text_transformer = Pipeline(
steps=[
("tfifd_vectorizer", TfidfVectorizer()),
]
)

numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)

preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("text", text_transformer, text_features)
],
remainder='drop'
)

clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", Lasso())]
)

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  sklearn.neural_network MLPClassifier forecast variances CK1960 1 1,821 Oct-29-2020, 10:13 AM
Last Post: CK1960
  Customizing an sklearn submodule with cython JHogg11 0 1,971 May-27-2020, 05:39 PM
Last Post: JHogg11
  sklearn and train_test_split nsadams87xx 1 1,842 Apr-23-2020, 05:32 PM
Last Post: jefsummers
  Error When Using sklearn Predict Function firebird 0 2,072 Mar-21-2020, 04:34 PM
Last Post: firebird
  Outputing LogisticRegression Coefficients (sklearn) RawlinsCross 6 4,800 Feb-27-2020, 02:47 PM
Last Post: RawlinsCross
  Predicting an output variable with sklearn Ccross1 1 2,533 Jun-04-2019, 03:11 PM
Last Post: michalmonday
  sklearn regression to excel punksnotdead 1 2,774 Apr-14-2019, 12:32 PM
Last Post: punksnotdead
  sklearn imported but not recognized kerberg 6 16,473 Jun-18-2017, 12:32 PM
Last Post: snippsat
  Sklearn Agglomerative Hierarchical Clustering - help with array set up pstarrett 4 5,323 Feb-21-2017, 05:05 AM
Last Post: pstarrett

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020