D
24

My model training run crashed at 92% and I almost lost a week's work

I was running a big fine-tuning job on a local server in Austin, and the power flickered just as it was about to finish. I thought I'd have to start over, but I found the checkpoint files and managed to restart from the last save point. Has anyone else had to recover from a crash that close to the end?
3 comments

Log in to join the discussion

Log In
3 Comments
kim_nelson
kim_nelson26d ago
@jaken23 a friend in Denver had a power surge at 97% and spent two days fixing corrupted data.
6
evan_dixon67
You said "almost lost a week's work," but the checkpoints mean you only lost the time since the last save. That's still a huge win.
1
jaken23
jaken232mo ago
Exactly, it's all about cutting down the risk. Even saving every hour turns a potential disaster into just a minor annoyance. That's the whole point of the system.
1