A senior ML engineer told me my model was just parroting the training data, not generalizing

I spent 3 months building a recommendation system for a small e-commerce site, only to have a friend who works at Google point out my accuracy was high because the model was basically memorizing user IDs and past purchases. He showed me how dropping the user ID feature dropped my validation score by 40%, proving it wasn't learning any useful patterns. Has anyone else had a model that looked great on paper but was really just cheating?

2 comments

2 Comments

kelly_hill6d ago

Your friend is right about the general problem but the bit about dropping user ID dropping your score by 40% actually proves the opposite of what he said. A model that memorizes user IDs would still rely on them heavily, but a model that genuinely learns user preferences should also lose a ton of information when you remove user IDs. The real test isn't just whether removing user IDs hurts performance, but whether your model can make good predictions for new users it has never seen before. That's the real generalization problem - if you train on user A and test on user A, of course the model will look perfect even if it's just memorizing. Try a time-based split where you train on old users and test on completely new users who showed up later, that's where the cheating really shows up. Or better yet, try predicting purchases for a user who only has one or two historical interactions in the training set.

reese866d ago

Yeah that actually makes a lot of sense... totally changed my mind on this.