Python Forum

Dear community,

Im seeking your help for my data science thesis project. Im looking for scikit learn experts who can help me answer the following question. Your help is greatly appreciated!

I want to create a model which has both numerical and textual features using scikit learn
-I want to scale the numerical features
-I want to process the textual features using a TDIDF vectorizer

In the end, I would like to have a model (a different model in this case I'm starting out with lasso but will be moving to ensemble models later) that can tune parameters. It would be best to also tune the arguments within the TFIDF vectorizer such as the ngram range etc, but i don't know if this is also possible.

Below you can find my current progress. The code returns the error : ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 143 and the array at index 1 has size 2.

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Lasso

y = df['esg']

text_features = ['sustainability_sentences', 'ESG_sentences_section_3']

numeric_features = ['section_1_esg_count', 'section_2_esg_count', 'section_3_esg_count',
'section_4_esg_count', 'section_5_esg_count', 'section_6_esg_count',
'section_7_esg_count', 'section_8_esg_count', 'total_esg_keyword_count',
'sections_with_esg_keywords', 'section_1_environmental_count',
'section_2_environmental_count', 'section_3_environmental_count',
'section_4_environmental_count', 'section_5_environmental_count',
'section_6_environmental_count', 'section_7_environmental_count',
'section_8_environmental_count', 'total_environmental_keyword_count',
'section_1_social_count', 'section_2_social_count',
'section_3_social_count', 'section_4_social_count',
'section_5_social_count', 'section_6_social_count',
'section_7_social_count', 'section_8_social_count',
'total_social_keyword_count', 'section_1_governance_count',
'section_2_governance_count', 'section_3_governance_count',
'section_4_governance_count', 'section_5_governance_count',
'section_6_governance_count', 'section_7_governance_count',
'section_8_governance_count', 'total_governance_keyword_count']

text_transformer = Pipeline(
steps=[
("tfifd_vectorizer", TfidfVectorizer()),
]
)

numeric_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)

preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("text", text_transformer, text_features)
],
remainder='drop'
)

clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", Lasso())]
)

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

aaldb