Python Forum
Partitioning when splitting data into train and test-dataset - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Partitioning when splitting data into train and test-dataset (/thread-23026.html)



Partitioning when splitting data into train and test-dataset - Den0st - Dec-07-2019

[Image: mDDYhdn]
In this image you can see a simplified example from how my dataset looks like.

My goal is to create a text-classifier which can be used to predict whether a paragraph from a document has one or more labels. (Multi-label classification) but my very first step is to split the data into train and test-data. The CSV-file with the data contains many paragraphs from multiple documents.
The issue is that I need to make the split on document level to make sure that there are no paragraphs from one document in the train-set and other paragraphs from thesame document in the test-set.

I know how sklearn's train_test_split() works but doing this and also making sure that the documents from the train-set are not present in the test-set is something where i've already done research on but still have no clue about it :/.

Could anyone give me a help in telling me how i can make this happen? I would really appreciate that.