Update: I was feeding my model way too much data for a simple task

I was building a tool to sort support tickets for a small shop in Boise, and I kept adding more training examples thinking it would get smarter. After three weeks, my friend asked me why it needed 10,000 examples just to tell if a ticket was about shipping or a broken item. That question made me stop and check the results, and the model was basically just memorizing the examples instead of learning the simple rule. Has anyone else found that a smaller, cleaner dataset actually works better for these basic classification jobs?

3 comments

3 Comments

sarah1982mo ago

Totally agree with your friend's question. That's the classic "more data is better" trap. Your model was just memorizing tickets instead of learning the simple pattern. For a basic job like sorting two things, a tiny set of clear examples is all you need. It forces the model to actually find the rule. I've seen this happen so many times.

elizabeth4381mo ago

That Clark guy's spam story is spot on. Clean data beats big data every time.

clark.iris2mo ago

I read a blog post last year about a guy who trained a model to spot spam emails. He started with 50,000 messages and the thing was a mess, but when he cut it down to just 500 really clear examples it worked almost perfectly. It's easy to think more data is always better, but for a simple yes or no job, the model just gets lost in the noise. You probably only needed a few hundred good tickets to teach it the difference between shipping and broken items.

-1