D
23

My whole week got wrecked by a bad AI training run

Last Thursday, I kicked off a new model training job on a custom dataset, expecting it to take maybe 12 hours. It ran for over 60 hours straight, using up all my cloud credits, and then crashed right at the end. The error log was just a line about a memory overflow, with no save point. I had to explain to my team lead why we had nothing to show for the budget. Has anyone else had a training job fail that badly after so much time and money?
3 comments

Log in to join the discussion

Log In
3 Comments
haydenc10
haydenc1026d ago
Ouch, that's a brutal one. I feel your pain, though my version is more like driving a shipment across three states only to find the warehouse closed for a holiday. That total loss after so much time just sinks your stomach. I'd be staring at that error log for a week.
4
jaden69
jaden698d ago
Actually @haydenc10, staring at the log IS the right move, you gotta find what broke before you run it again.
1
clark.susan
Did you try checking the logs before it crashed?
3