Python Forum

Full Version: Random Forest to Identify Page: Feature Selection
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I am new to machine learning. I know of a proj that used Random Forest to identify the type of pages in financial reports - identify if a page is the CashFlow or Income Statement.

The features for the model:
1) Bag of Word (BOW) for all pages in all the financial reports
2) word_check_flow: 1 if page has word "flow"; 0 otherwise
3) word_check_income: 1 if page has {“income” & “expense”} or {“revenue”, “sales”, “loss”}; 0 otherwise

I am puzzled as to know why there is a need for word_check_flow & word_check_income as features when BOW will give the count of each word in the page.

Thank you