When training a machine learning model, we normally prefer selecting a generalized model which is performing well both on training and validation/test data. However, there can be a situation where the model performs well during testing but fails to achieve the same level of performance with real-world (production data) usage. For example, your model is giving 95% accuracy on test data but as soon as it productized and acts on real data, it fails to achieve the same or nearby performance. Such a discrepancy between test performance and real-world performance is often referred to as Leakage.
What is Train/Test bleed?
Train/test bleed Inadvertent(accidental) overlap of training and test dataset when sampling to create datasets. In Layman’s terms while creating a model we accidentally share information between the train and test datasets. When splitting a dataset into training and testing datasets we should ensure that no data is shared between the two. It often results in an unrealistically high level of performance on test datasets as the model has already seen some of the data while training and predicts the correct label. This will mislead while evaluating the model. To avoid this kind of circumstance we can:
• Take samples from fresh data
• Filter out already selected instances
• Careful with time series data and data with duplicate entries.
Using information during training or validation that is not available in production. In simple words model’s performance is very high on validation data but data is not available in the production environment which will badly impact the performance of the model.
How to avoid data leakage?
- Duplicates: When your dataset comes from noisy, real-world data, data duplication is the most common issue. In this case, there is a very high chance that your train and test datasets might be having the same data. To avoid this kind of situation either delete the duplicate data or you can use fuzzy matching (fuzzywuzzy, difflib etc.)
- Temporal Data: Even when we are not explicitly leaking information, we may still experience data leakage if we have dependencies in our train and test dataset. This is most common in time series data where time plays an important role. Consider a scenario where we have two data points 1 and 3 in training data and one datapoints 2 in testing data. Suppose that the temporal(time-related) ordering of these data points is 1->2->3. By training on point 3 and testing on point 2 we created an unrealistic situation in which we train our model on future knowledge. Therefore, we have leaked the information as in a real-world scenario our model will not have any future knowledge.
- Preprocessing: To avoid data leakage we should apply data preprocessing separately on train and test datasets.
- Create a Separate Validation Set: To minimize the problem of data leakage we should keep the validation set apart from the training and testing dataset. A validation set is used to mimic the real-life scenario.
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at email@example.com