Python Forum

Full Version: Partitioning when splitting data into train and test-dataset
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[Image: mDDYhdn]
In this image you can see a simplified example from how my dataset looks like.

My goal is to create a text-classifier which can be used to predict whether a paragraph from a document has one or more labels. (Multi-label classification) but my very first step is to split the data into train and test-data. The CSV-file with the data contains many paragraphs from multiple documents.
The issue is that I need to make the split on document level to make sure that there are no paragraphs from one document in the train-set and other paragraphs from thesame document in the test-set.

I know how sklearn's train_test_split() works but doing this and also making sure that the documents from the train-set are not present in the test-set is something where i've already done research on but still have no clue about it :/.

Could anyone give me a help in telling me how i can make this happen? I would really appreciate that.