Python Forum
Data Science Project - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Data Science Project (/thread-23793.html)



Data Science Project - DaisyPJ - Jan-17-2020

Hi

I am newly learning data science and am wondering if the below will qualify for a project that can be implemented in Python using ML algorithms:

I have master data set that I will have to extract from a pdf. It will have 2 fields e.g. Area Code and Area as below:

AreaCode Area
3100 Gate
3110 Sumps
3230 Fireworks
4222 Air Purifier
4335 Water Filter

I have a second dataset which is created after searching a pdf and extracting data having one field Object Name e.g.
ObjectName
A1-G-3100012
A1-K-3100010
A1-K-3230010
A1-P-3230015
A1-P-4222015
A1-G-4235016
A1-G-4335012
A1-K-3110010
A1-K-3230010
A1-P-3230025
A1-P-4335075
A1-G-4235086
A1-M-3100012
A1-H-3100010
A1-H-3230010
A1-V-3230015
A1-V-4222015
A1-M-4235016
A1-M-4335012
A1-H-3110010
A1-H-3230010
A1-V-3230025
A1-V-4335075
A1-M-4235086

I want to create a model that will learn first dataset and populate AreaCode in second dataset.

Does this make sense for an application of datascience?

Sorry about my ignorance but requesting some inputs.

Regards


RE: Data Science Project - jefsummers - Jan-17-2020

I don't see that as data science - you are just going to create an algorithm that compares a slice of the strings in the second data set with the first set, which is a lookup table.

Now if you had a purchase history and wanted to predict most likely subsequent purchases based on the first, that's data science (a recommendation system). i.e. if someone buys a sump should you as vendor send them a spam email advertising a water filter or fireworks?


RE: Data Science Project - DaisyPJ - Jan-19-2020

Thank you for the response Her.
Wouldn't this even qualify for a classification?

Regards


RE: Data Science Project - jefsummers - Jan-19-2020

Quoting from Data Science for Dummies (please do not take offence, that is not intended) "With classification algorithms, you take an existing dataset and use what you know about it to generate a predictive model for use in classification of future data points. If your goal is to use your dataset and its known subsets to build a model for predicting the categorization of future data points, you’ll want to use classification algorithms."

You don't need a model, you can say with 100% certainty what the class is as it is included in the pbject name. So, I would not, but that's just my opinion.