![]() ![]() |
Machine Learning Overview (page 9 of 13) |
What happens if an observation is contained in both the training and testing data sets? This results in data leakage, which means our training and testing data sets are contaminated similar to contaminated drinking water. We cannot train and test a model on the same data. Let's think about this concept in terms of a University exam. If I give you a review sheet and test you only on the questions on the review sheet, then I am not testing you on how well your knowledge generalizes to unseen scenarios. That would be analagous to data leakage.
Important Note: You will find several websites and books that use the term data leakage synonymously with target leakage. In this course, I will use the terms separately to refer to two distinctly different issues.