24
My model training run crashed at 92% and I almost lost a week's work
I was running a big fine-tuning job on a local server in Austin, and the power flickered just as it was about to finish. I thought I'd have to start over, but I found the checkpoint files and managed to restart from the last save point. Has anyone else had to recover from a crash that close to the end?
2 comments
Log in to join the discussion
Log In2 Comments
evan_dixon676h ago
You said "almost lost a week's work," but the checkpoints mean you only lost the time since the last save. That's still a huge win.
1
jaken235h ago
Exactly, it's all about cutting down the risk. Even saving every hour turns a potential disaster into just a minor annoyance. That's the whole point of the system.
1