Help! Logistic regression with balanced data.

xstazzie · (This post was last modified: Oct-14-2023, 10:59 AM by Larz60+.)

Hello, I'm doing an AI course where i need to complete an exercise about classification and machine learning, and I'm really struggling with it. I've done the whole exercise but the only part that's wrong is the code on the bottom of the thread and the only error message that I'm getting is: "X is assigned to the wrong data". If anyone could help me out with the code id really appreciate it.

I tried to use class_weight = "balanced" and split the data like X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50), but nothing seems to work.

The exercise is the following:
Course: https://ai-for-naturligt-sprak.ida.liu.se/
Chapter 3, question 10 (Logistic regression). In the following code cell, we will use the LogisticRegression model for classification of several classes. Set the model's parameters so that the following are true:

• random_state has a fixed value so that the model's results are reproducible.
• The model should be trained for at most 500 iterations, so that the model has time to converge to a good solution (even if it needs more steps to converge).
• The model must use saga for optimization.
• Optional: set verbose=True to print out what iteration the model is on.

To implement this, you probably need to read about the model's parameters in Scikit-learn's documentation.
The data is vectorized with CountVectorizer so that results can be compared to the Naive Bayes model. However, there are other vectorizers, for example one that transforms to tf-idf, TfidfVectorizer.

Train the model (unbalanced training data). Training on balanced data takes place in a subsequent code cell.

The following files are used in the assignment:

• speeches-201718.json.bz2 - Speeches in the Swedish Parliament, 2017/2018.
• speeches-201819.json.bz2 - Speeches in the Swedish Parliament, 2018/2019.
Our data consists of all speeches in the Swedish Parliament during sessions from the years 2017/2018 and 2018/2019. The raw data is taken from the Riksdag's open data and the speeches are divided into two files:

• speeches-201718.txt with 12,343 speeches
• speeches-201819.txt with 9,288 speeches

Code for unbalanced data:

import pandas as pd

import bz2
with bz2.open("data/ch3/speeches-201718.json.bz2") as source:
speeches_201718 = pd.read_json(source)

with bz2.open("data/ch3/speeches-201819.json.bz2") as source:
speeches_201819 = pd.read_json(source)

training_data, test_data = speeches_201718, speeches_201819

print("Antalet tal från 2017/2018:", len(training_data))
print("Antalet tal från 2018/2019:", len(test_data))

print("\nFem första datapunkterna:\n", speeches_201718.head())

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Skapa en instans av CountVectorizer
vectorizer = CountVectorizer()

# Skapa en instans av LogisticRegression med de specificerade parametrarna
clf = LogisticRegression(random_state=42, max_iter=500,
solver='saga', verbose=True)

# Skapa en Pipeline som inkluderar både vektorisering och klassificeringsmodellen
log_regr = Pipeline([('count_vect', vectorizer), ('log_regr', clf)])

# Träningsdata: X och y, anpassa dessa till dina data
X = training_data['words']
y = training_data['party']

# Träna Logistic Regression-modellen
log_regr.fit(X, y)

# Testdata: X_test, anpassa detta till dina testdata
X_test = test_data['words']

# Gör förutsägelser med den tränade modellen
pred_lr = log_regr.predict(X_test)

print("Urval av prediktioner [0-4]:", pred_lr[:5])


Create a confusion matrix comparing the classification from the logistic regression to the true values from the test data.

from sklearn.metrics import confusion_matrix
import pandas as pd

# Beräkna förväxlingsmatrisen
conf_mat = confusion_matrix(y_true=test_data['party'], y_pred=pred_lr)

# Hämta de unika etiketterna
labels = sorted(test_data['party'].unique())

# Skapa en DataFrame för att visa matrisen på ett tydligare sätt
conf_mat_df = pd.DataFrame(conf_mat, index=labels, columns=labels)

print("Förväxlingsmatris:")
print(conf_mat_df)

In comparison with the Naive Bayes model, in the following code cell you can see which parts are most often confused with each other.

count = 0
for index, row in conf_mat_df.iterrows():
max_value = row.idxmax()
print(conf_mat_df.columns.values[count],
'förväxlas mest med', max_value,
'.')
count = count+1

Create a classification report.

from sklearn.metrics import classification_report

# Antag att y_true är dina sanna klasser och y_pred är dina förutsagda klasser
y_true = test_data['party'] # Fyll i dina sanna klasser här
y_pred = pred_lr # Fyll i dina förutsagda klasser här

# Skapa en klassificeringsrapport
class_report = classification_report(y_true, y_pred)

# Skapa en klassificeringsrapport som en dictionary
class_report_dict = classification_report(y_true, y_pred, output_dict=True)

print("Klassificeringsrapport:")
print(class_report)

Now do the same training and evaluation but on BALANCED training data.Parts of the code are filled out below:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer()
clf = LogisticRegression(_____)

log_regr = Pipeline([('count_vect', vectorizer), ('log_regr',clf)])

X = _____
y = _____
log_regr.fit(X, y)

X_test = _____
pred_lr = log_regr.predict(X_test)


class_report = _____
# class_report_dict används för att kontrollera att resultatet stämmer
class_report_dict = _____

print("Klassificeringsrapport:")
print(class_report)

Larz60+ write Oct-14-2023, 10:59 AM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
I Modified for you this time. Please use BBCode tags on future posts.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	At what point while building/testing Logistic Regression do I perform the PCA or MCA?	deadendstreet2	0	1,838	Feb-01-2021, 07:02 PM Last Post: deadendstreet2
	"Plotting and Computation of Logistic Function by Using Taylor Series with Recursive	canatilaa	1	2,533	May-13-2020, 04:01 PM Last Post: jefsummers

Help! Logistic regression with balanced data.

User Panel Messages

Announcements