![]() |
Partitioning when splitting data into train and test-dataset - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Partitioning when splitting data into train and test-dataset (/thread-23026.html) |
Partitioning when splitting data into train and test-dataset - Den0st - Dec-07-2019 In this image you can see a simplified example from how my dataset looks like. My goal is to create a text-classifier which can be used to predict whether a paragraph from a document has one or more labels. (Multi-label classification) but my very first step is to split the data into train and test-data. The CSV-file with the data contains many paragraphs from multiple documents. The issue is that I need to make the split on document level to make sure that there are no paragraphs from one document in the train-set and other paragraphs from thesame document in the test-set. I know how sklearn's train_test_split() works but doing this and also making sure that the documents from the train-set are not present in the test-set is something where i've already done research on but still have no clue about it :/. Could anyone give me a help in telling me how i can make this happen? I would really appreciate that. |