Python Forum
Using 'Text' features for model
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using 'Text' features for model
#1
Hi All,

I'm relative new to Python and concepts such as Machine Learning. However, the methods and possibilities of using Python are so nice that it make me curious enough to learn more about it.

I've done some simple projects where I could safely use only numbers within my 'project' till this day. Today i've tried to start a more 'complicated' project (for my self ;-)) and found out there are a lot of 'Text' Fields within my dataset. Some of then are easy to fix (skip spaces etc.) and some are... for me hard to fix like a code combined of text and numbers. For example: HJ-0-A5-CO-384823232-A385983.

I could 'split' all those numbers into numbers (could be binary, or just own code..). But it will be a waste of time, because i think someone else must experience this problem before.

To be clear: I try to make a simple model that uses features such as these 'codes' and numbers.

Who can help me out?

(option I have look into in: NLP with scikit-learn and NLTK (But i can't find a solution for the same codes)

Thnx!

TP
Reply
#2
Why do you want to make a model based on those numbers? Do they have any meaning that isn't coded elsewhere in the data? I've used data with similarly complicated codes, parts of which had some meaning, but that meaning was always coded elsewhere in the data in a more appropriate (and often more precise) form.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
(Jul-09-2018, 12:08 PM)ichabod801 Wrote: Why do you want to make a model based on those numbers? Do they have any meaning that isn't coded elsewhere in the data? I've used data with similarly complicated codes, parts of which had some meaning, but that meaning was always coded elsewhere in the data in a more appropriate (and often more precise) form.

Hi, Thanks for answering.

To answer your question: Yes, the code is describing a breakdown structure of assets. For example: the first 2 characters describe the country, secondly the site (where its located) etc etc.

In some cases, some locations for different codes are difficult to access. But while i'm typing and reading your point. I start overthinking to cut the code in pieces, so I can compare the steps in more detail.

If anyone, or you, have some thoughts on this topic. Feel free to share,

Thanks!

Regards
Reply
#4
If you had the data, and you were to do it manually, how would you handle those codes?

There is no answer for doing it automatically until you know how to do it manually. Unless you already have a huge dataset of codes what those codes translate into, so you can model that data to build a probability matrix (ie: if you see "C0", it normally means "43254").
Reply
#5
Well, if the information isn't coded elsewhere in the data, I would just add variables, probably categorical ones, to add that information to the data.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#6
(Jul-09-2018, 07:56 PM)nilamo Wrote: If you had the data, and you were to do it manually, how would you handle those codes?

There is no answer for doing it automatically until you know how to do it manually. Unless you already have a huge dataset of codes what those codes translate into, so you can model that data to build a probability matrix (ie: if you see "C0", it normally means "43254").

I made a dataset within these codes. So I might give it a shot. I think I make different variables of the total code. It probably will give some good insights.

This noon I tried to do it with a probability matrix. It saved some time though, but never enough (:.

Thanks for thinking with me!

(Jul-09-2018, 07:57 PM)ichabod801 Wrote: Well, if the information isn't coded elsewhere in the data, I would just add variables, probably categorical ones, to add that information to the data.

Thanks for the tip! I'll will try this out tomorrow.
Reply
#7
(Jul-09-2018, 09:01 PM)theunpossbile Wrote:
(Jul-09-2018, 07:56 PM)nilamo Wrote: If you had the data, and you were to do it manually, how would you handle those codes?

There is no answer for doing it automatically until you know how to do it manually. Unless you already have a huge dataset of codes what those codes translate into, so you can model that data to build a probability matrix (ie: if you see "C0", it normally means "43254").

I made a dataset within these codes. So I might give it a shot. I think I make different variables of the total code. It probably will give some good insights.

This noon I tried to do it with a probability matrix. It saved some time though, but never enough (:.

Thanks for thinking with me!

(Jul-09-2018, 07:57 PM)ichabod801 Wrote: Well, if the information isn't coded elsewhere in the data, I would just add variables, probably categorical ones, to add that information to the data.

Thanks for the tip! I'll will try this out tomorrow.

Hi Guys,

FYI: Cutting the code into pieces worked fine for me.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  model.fit and model.predict errors hatflyer 6 1,309 Nov-10-2023, 01:39 AM
Last Post: hatflyer
  associating members in a list to more than one features rezagholi 0 607 Mar-09-2023, 03:33 PM
Last Post: rezagholi
  FileNotFoundError: [Errno 2] No such file or directory: 'model/doc2vec.model/Articles Anldra12 10 5,777 Jun-11-2021, 04:48 PM
Last Post: snippsat
  HELP: need a Keylogger with those features cocododo 1 1,954 May-03-2019, 12:33 AM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020