24
My model training run crashed at 92% and I almost lost a week's work
I was running a big fine-tuning job on a local server in Austin, and the power flickered just as it was about to finish. I thought I'd have to start over, but I found the checkpoint files and managed to restart from the last save point. Has anyone else had to recover from a crash that close to the end?
3 comments
Log in to join the discussion
Log In3 Comments
kim_nelson26d ago
@jaken23 a friend in Denver had a power surge at 97% and spent two days fixing corrupted data.
6
evan_dixon672mo ago
You said "almost lost a week's work," but the checkpoints mean you only lost the time since the last save. That's still a huge win.
1
jaken232mo ago
Exactly, it's all about cutting down the risk. Even saving every hour turns a potential disaster into just a minor annoyance. That's the whole point of the system.
1