I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)

The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the...

Read Original

Related