ValueError: could not convert string to float: '4 AVENUE' - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: ValueError: could not convert string to float: '4 AVENUE' (/thread-23942.html) |
ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-23-2020 I tried to run regression using regr = linear_model.LinearRegression() regr.fit(X, y)My data contains columns with DateTime format and another with physical address, such as '8300 4 AVENUE 1'. When I ran the code, I received the following error: ValueError Traceback (most recent call last) <ipython-input-119-8a11d5d4a70e> in <module> 1 regr = linear_model.LinearRegression() ----> 2 regr.fit(X, y) ~\New\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, sample_weight) 456 n_jobs_ = self.n_jobs 457 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'], --> 458 y_numeric=True, multi_output=True) 459 460 if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1: ~\New\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator) 754 ensure_min_features=ensure_min_features, 755 warn_on_dtype=warn_on_dtype, --> 756 estimator=estimator) 757 if multi_output: 758 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False, ~\New\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 565 # make sure we actually converted to numeric: 566 if dtype_numeric and array.dtype.kind == "O": --> 567 array = array.astype(np.float64) 568 if not allow_nd and array.ndim >= 3: 569 raise ValueError("Found array with dim %d. %s expected <= 2." ValueError: could not convert string to float: '4 AVENUE'I decided to drop the datetime column at this stage but I need the address column for my analysis. Please, do help me. Thank you in advance I also tried to convert the address column to float but it converted the whole column to NAN, rendering the whole process useless RE: ValueError: could not convert string to float: '4 AVENUE' - scidam - Jan-24-2020 In general, linear regression expects numbers. So, you need to perform some feature engineering first. E.g. you can convert addresses to coordinates (if this make sense with the problem you're trying to solve): lat and lon; Also, you can build separate regression models for each address you have. You can handle dates as shown here. RE: ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-24-2020 Thank you, scidam. I've been able to deal with the timestamp problem. My challenge is, I'm trying to predict a string variable and so expect the system to convert it to float. Actually, I'm using this (in)famous NYC-311 Service complaints data and need to predict the number of future complaints (for a particular complaint type I'd identified earlier). Everything goes well but, even when I tried to convert the string-formatted complaint type dependent variable, it gives me this same error message: ValueError: could not convert string to float: 'HEATING'. Please, is there any other way to treat this variable? Below is a summary of my work so far: #Fetching the data source = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=10000000&Agency=HPD&$select=created_date,unique_key,complaint_type,Descriptor,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status' if os.path.isfile('./assets/csr/erm2-nwe9.csv') == True: my_data = pd.read_csv('./assets/csr/erm2-nwe9.csv', sep=',', parse_dates=['created_date', 'closed_date'], low_memory=False, index_col = [0]) else: my_data = pd.read_csv(source, sep=',', parse_dates=['created_date', 'closed_date'], low_memory=False, index_col = [0]) my_data.to_csv('./assets/csr/erm2-nwe9.csv', index_col = [0]) #Identifying the commonest complaint type: my_data1 = my_data.loc[my_data['complaint_type']=='HEATING'].dropna() # Dealing with DateTime: my_data1['created_date'] = pd.to_datetime(my_data1['created_date'],errors="coerce") my_data1['Hour'] = my_data1["created_date"].dt.strftime('%H') my_data1['Day'] = my_data1["created_date"].dt.strftime('%d') my_data1['Month'] = my_data1["created_date"].dt.strftime('%m') my_data1['Year'] = my_data1["created_date"].dt.strftime('%Y') # Dropping unnecessary columns my_data_1 = my_data1.drop(['unique_key', 'created_date', 'incident_address', 'street_name', 'address_type', 'city', 'resolution_description', 'location_type', 'borough', 'closed_date', 'status'], axis = 1) my_data_1 = my_data_1.dropna() #Splitting the data X=my_data_1.loc[:,my_data_1.columns != "complaint_type"] y=my_data_1["complaint_type"] X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=0) #standadizing the data sc = StandardScaler() X_trainset = sc.fit_transform(X_trainset) X_testset = sc.transform(X_testset) # Running the regression model regr = linear_model.LinearRegression() regr.fit(X, y)And here's where the problems begin. Please, do help. Thank you RE: ValueError: could not convert string to float: '4 AVENUE' - jefsummers - Jan-26-2020 Consider one-hot encoding. See if this helps: One hot encoding a feature in a dataframe RE: ValueError: could not convert string to float: '4 AVENUE' - Kudzo - Jan-26-2020 (Jan-26-2020, 12:34 PM)jefsummers Wrote: Consider one-hot encoding. See if this helps: One hot encoding a feature in a dataframeThank you, jefsummers |