ValueError: could not convert string to float: '4 AVENUE'

ValueError: could not convert string to float: '4 AVENUE' - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: ValueError: could not convert string to float: '4 AVENUE' (/thread-23942.html)

ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-23-2020

I tried to run regression using

regr = linear_model.LinearRegression()
regr.fit(X, y)

My data contains columns with DateTime format and another with physical address, such as '8300 4 AVENUE
1'. When I ran the code, I received the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-119-8a11d5d4a70e> in <module>
      1 regr = linear_model.LinearRegression()
----> 2 regr.fit(X, y)

~\New\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight)
    456         n_jobs_ = self.n_jobs
    457         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 458                          y_numeric=True, multi_output=True)
    459 
    460         if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:

~\New\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    754                     ensure_min_features=ensure_min_features,
    755                     warn_on_dtype=warn_on_dtype,
--> 756                     estimator=estimator)
    757     if multi_output:
    758         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\New\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    565         # make sure we actually converted to numeric:
    566         if dtype_numeric and array.dtype.kind == "O":
--> 567             array = array.astype(np.float64)
    568         if not allow_nd and array.ndim >= 3:
    569             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: '4 AVENUE'

I decided to drop the datetime column at this stage but I need the address column for my analysis.
Please, do help me.
Thank you in advance

I also tried to convert the address column to float but it converted the whole column to NAN, rendering the whole process useless

RE: ValueError: could not convert string to float: '4 AVENUE' - scidam - Jan-24-2020

In general, linear regression expects numbers. So, you need to perform some feature engineering first. E.g. you can convert addresses to coordinates (if this make sense with the problem you're trying to solve): lat and lon; Also, you can build separate regression models for each address you have. You can handle dates as shown here.

RE: ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-24-2020

Thank you, scidam. I've been able to deal with the timestamp problem. My challenge is, I'm trying to predict a string variable and so expect the system to convert it to float.
Actually, I'm using this (in)famous NYC-311 Service complaints data and need to predict the number of future complaints (for a particular complaint type I'd identified earlier). Everything goes well but, even when I tried to convert the string-formatted complaint type dependent variable, it gives me this same error message:

ValueError: could not convert string to float: 'HEATING'

.
Please, is there any other way to treat this variable?
Below is a summary of my work so far:

#Fetching the data
source = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=10000000&Agency=HPD&$select=created_date,unique_key,complaint_type,Descriptor,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status'
if os.path.isfile('./assets/csr/erm2-nwe9.csv') == True:
    my_data = pd.read_csv('./assets/csr/erm2-nwe9.csv', sep=',', parse_dates=['created_date', 'closed_date'], low_memory=False, index_col = [0])
else:
    my_data = pd.read_csv(source, sep=',', parse_dates=['created_date', 'closed_date'], low_memory=False, index_col = [0])
    my_data.to_csv('./assets/csr/erm2-nwe9.csv', index_col = [0])

#Identifying the commonest complaint type: my_data1 = my_data.loc[my_data['complaint_type']=='HEATING'].dropna()

# Dealing with DateTime:
my_data1['created_date'] = pd.to_datetime(my_data1['created_date'],errors="coerce")
my_data1['Hour'] = my_data1["created_date"].dt.strftime('%H')    
my_data1['Day'] = my_data1["created_date"].dt.strftime('%d')    
my_data1['Month'] = my_data1["created_date"].dt.strftime('%m')    
my_data1['Year'] = my_data1["created_date"].dt.strftime('%Y')

# Dropping unnecessary columns
my_data_1 = my_data1.drop(['unique_key', 'created_date', 'incident_address', 'street_name', 'address_type', 
        'city', 'resolution_description', 'location_type', 'borough', 'closed_date', 'status'], axis = 1)
my_data_1 = my_data_1.dropna()

#Splitting the data
X=my_data_1.loc[:,my_data_1.columns != "complaint_type"]
y=my_data_1["complaint_type"]
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=0)

#standadizing the data
sc = StandardScaler()  
X_trainset = sc.fit_transform(X_trainset)  
X_testset = sc.transform(X_testset)

# Running the regression model
regr = linear_model.LinearRegression()
regr.fit(X, y)

And here's where the problems begin.
Please, do help.
Thank you

RE: ValueError: could not convert string to float: '4 AVENUE' - jefsummers - Jan-26-2020

Consider one-hot encoding. See if this helps:
One hot encoding a feature in a dataframe

RE: ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-26-2020

(Jan-26-2020, 12:34 PM)jefsummers Wrote: Consider one-hot encoding. See if this helps: One hot encoding a feature in a dataframe

Thank you, jefsummers