Open Weights isn't Open Training

When I was in college, my data structures professor told a story. It went something like this:

"When I was your age, I received an assignment, and encountered an inexplicable bug. I debugged and debugged and found that adding a print statement resolved the bug. I was young like all of you, and I was certain I'd found a bug in the C compiler. Turns out the problem was me."

The takeaway was clear: if you have a bug, it's your fault.

This is a good heuristic for most cases, but with open source ML infrastructure, you need to throw this advice out the window. There might be features that appear to be supported but are not. If you're suspicious about an operation or stage that's taking a long time, it may be implemented in a way that's efficient enough…for an 8B model, not a 1T+ one. HuggingFace is good, but it's not always correct. Libraries have dependencies, and problems can hide several layers down the stack. Even Pytorch isn't ground truth.

Over the past couple months, I worked on developing infrastructure to post-train and serve models cheaply. Ultimately, my team decided to develop a custom training codebase, but only after I spent a few days attempting to use existing open-source options. The following is an account of my successes and failures and what it means for open-weights models.

Making it work

The goal is to post-train Kimi-K2-Thinking. My success criteria is both qualitative and quantitative: loss should go down and the model should change behavior in line with the dataset we train on.

It’s an open source model, so surely there should be some training code online. But it turns out there isn’t really any. LLaMA-Factory + KTransformers is supposed to support it, but I encountered a bunch of bugs. Also, it’s designed for CPU offloading + GPU training, which adds unnecessary complexity and is inefficient.

What about HuggingFace? It has basically everything. Kimi-k2-thinking is available along with a config and modeling class which seems to support and implement the model. The HuggingFace model info doesn’t say whether training is supported, but HuggingFace’s Transformers library supports models in the same architecture family, such as DeepSeek-V3. The fundamentals seem to be there; we might need some small changes, but how hard can it be?

First, we need a dataset for which we’ll be able to tell if the model has trained. Let's create one that will make our model talk like Yoda. We can get a bunch of questions from TriviaQA, and generate responses by prompting an LLM to answer the question while pretending it’s Yoda. Running the script, I get a few thousand prompts and responses that look something like this:

Question: Who won Super Bowl XX?

Answer: Mmmmm, Super Bowl XX, you ask about? *taps cane thoughtfully*

In 1986, this was. The Chicago Bears, victorious they were! Dominant, their defense was — yes, hmmm. The New England Patriots, they faced. 46-10, the score was.

Legendary, that Bears team became. "The Super Bowl Shuffle," remember it, I do. *chuckles* Strong with the Force... err, strong with the defense, they were! William "Refrigerator" Perry, Walter Payton, Mike Singletary — powerful warriors, all of them.

Crushing victory it was. One of the greatest teams in NFL history, many say. Hmmmm, yes.

Next up, let’s load the model onto our GPUs. It’s time to understand what we’re working with and make hardware decisions. Kimi-K2-Thinking is a state-of-the-art open weight model. It’s a 1 trillion parameter mixture-of-experts model with multi-headed latent attention, and the (non-shared) expert weights are quantized to 4 bits. This means it comes out to 594 GB with 570 GB of that for the quantized experts and 24 GB for everything else.

I want to load the entire model into GPU memory, so given these specs, 8xH200s seems like the best bet with a combined 1128 GB of GPU memory.

At this point, I want to write a full LoRA training script and see how far it gets. If needed, I’ll debug along the way.

Full training script

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, TaskType, get_peft_model
from scripts.kimi.make_yoda_dataset import QADataset
import torch
from torch.utils.data import DataLoader
from torch.nn.utils import clip_grad_norm_


model_id = "moonshotai/Kimi-K2-Thinking"
target_modules = [
    "gate_proj", "up_proj", "down_proj", "experts",
    "q_a_proj", "q_b_proj", "kv_a_proj_with_mqa",
    "kv_b_proj", "o_proj"
]
tokenizer = AutoTokenizer.from_pretrained(
    model_id, trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
dataset = QADataset(
    "scripts/kimi/yoda_dataset.jsonl", tokenizer, max_length=2048
)
print(f"Loaded yoda dataset with {len(dataset)} examples")

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, use_cache=False,
    attn_implementation="flash_attention_2",
    device_map='auto', torch_dtype=torch.bfloat16
)
print("Loaded successfully!")

model.requires_grad_(False)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=target_modules,
    task_type=TaskType.CAUSAL_LM,
)
lora_model = get_peft_model(model, lora_config)
model.eval()
lora_model.train()
print("Applied lora!")

encodings = tokenizer("Hello, how are you?", return_tensors="pt")
encodings = {k: v.to(model.device) for k, v in encodings.items()}
outputs = model(**encodings)
print("Forward pass successful!",
    tokenizer.decode(outputs.logits.argmax(dim=-1)[0],
    skip_special_tokens=True))

trainable_params = [
    p for p in lora_model.parameters() if p.requires_grad
]
optimizer = torch.optim.AdamW(trainable_params, lr=2e-4)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

for i, batch in enumerate(dataloader):
    device = next(lora_model.parameters()).device
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = lora_model(**batch)
    if i == 0:
        print("Successful forward pass with lora!")
    loss = outputs.loss
    loss.backward()
    if i == 0:
        print("Successful backward pass!")
    clip_grad_norm_(lora_model.parameters(), max_norm=1.0)
    optimizer.step()
    print(f"Step {i} complete! Loss: {loss.item()}")
    optimizer.zero_grad()

Loaded yoda dataset with 1000 examples
Compressing model: 7587it [06:03, 20.85it/s]

Problem 1: Compression is slow

This ‘compression’ just keeps going for at least an hour. As a sanity check, let’s swap in Llama 8b.

The good news: Llama 8b skips compressing and trains perfectly. The bad news: we’ll have to venture into the transformers codebase to find this kimi-specific issue.

So, where is Compressing model coming from? I can search for it in the transformers package with grep \-r "Compressing model" ., but nothing comes up. Searching within all packages, there’s four hits in the vLLM compressed_tensors package. After some investigation that lets me narrow it down, it seems like it’s likely coming from the ModelCompressor.compress_model function as that’s called in transformers, in CompressedTensorsHfQuantizer._process_model_before_weight_loading.

compress_model appears to quantize the model by iterating through every module and quantizing them one by one. Maybe we can parallelize it. But also, our model is natively quantized. We shouldn't need to quantize it again, right? The weights are already in the quantized format. The function compress_model is called depending on if the config indicates the model is quantized, with no checks to see if it's already quantized. Well, let's try deleting the call to compress_model and see if the problem goes away and nothing else breaks.

def _process_model_before_weight_loading(self, model, **kwargs):
    from compressed_tensors.quantization import apply_quantization_config

    ct_quantization_config = self.compressor.quantization_config

    # Always initialize compressed wrappers to match the checkpoint
    apply_quantization_config(model, ct_quantization_config, self.run_compressed)

# DELETED:
#if (
    #    self.quantization_config.is_quantization_compressed
    #    or self.quantization_config.is_sparsification_compressed
    #):
    #    self.compressor.compress_model(model=model)

Loaded yoda dataset with 1000 examples
Loading checkpoint shards: 100%|##########| 62/62 [04:46<00:00, 4.62s/it]

The fix successfully skipped the initial compression!

Problem 2: ??? (and a brief history of GPU memory management)

We are now getting past the compression step. Except now it’s been 10 minutes and it still hasn’t printed ‘loaded successfully’…

After 20 minutes it loads, but it seems strange to take this long. I put some prints in to narrow down what’s taking the time. It’s getting stuck in accelerate’s dispatch_model function, which is supposed to distribute the loaded model across GPUs. Once the memory is already on the GPU’s, it still takes forever though. Nothing in the code looks suspicious. It doesn't seem like anything intensive happens after ‘Loading checkpoint shards’ completes.

I wanted to write this article like a story. I wanted the reader to be able to make sense of what’s happening at each point. But the solution here really just doesn’t make sense at all. I do not recall and cannot imagine how I discovered the solution.

It turns out you have to set PYTORCH_CUDA_ALLOC_CONF expandable_segments:True.

By default, freeing memory in CUDA is expensive because it does a GPU sync. Because of this, PyTorch avoids freeing and mallocing memory through CUDA, and tries to manage it itself. When blocks are freed, the allocator just keeps them in their own cache. The allocator can then use the free blocks in the cache when something else is allocated. But if these blocks are fragmented and there isn’t a large enough cache block and all GPU memory is already allocated, PyTorch has to free all the allocator cached blocks then allocate from CUDA, which is a slow process. This is what our program is getting blocked by. This situation might look familiar if you’ve taken an operating systems class.

Back in the day, computers had to figure out how to divide physical memory between different processes safely. The solution: each program gets its own virtual memory address space and contiguous virtual memory doesn’t have to be contiguous physical memory. Physical memory is chunked into fixed-size pages and allocated on demand. This solution has a nice bonus property: you can allocate contiguous blocks when free memory is fragmented. Virtual memory stuck around.

That is, until GPUs. In 1999, the first modern GPU was released and seemingly someone went “hmm, virtual memory”? No, this is a Graphics Processing Unit, it will only process graphics. No one using this will have multiple processes or complex memory management.

Fast forward a couple years, and people went, “oh, we can use GPUs for a lot more than processing graphics.” CUDA was released in 2007 to make it easier, but the GPUs didn’t have the hardware for virtual memory, so CUDA didn’t support it either.

In 2010, GPUs first supported virtual memory, but despite decades of development around virtual memory, CUDA virtual memory had two major limitations. First, it didn’t support memory overcommitment. That is, when you allocate virtual memory with CUDA, it immediately backs that with physical pages. In contrast, typically you get a large virtual memory space and physical memory is only mapped to virtual addresses when first accessed. Second, to be safe, freeing and mallocing forced a GPU sync which slowed them down a ton. This made applications like pytorch essentially manage memory themselves instead of completely relying on CUDA.

For PyTorch, this means:

We can't expand segments that have already been allocated and instead have to free and reserve a new one
The same fragmentation problems we get in physical memory show up in virtual memory and we can solve it by freeing everything but it takes a long time. 1 exacerbates this because it forces more mallocs and frees.

Ironically, Pytorch could make its own layer of virtual memory to solve this, but it would likely add overhead that exceeds the benefits.

In any case, in 2019, CUDA added a more comprehensive virtual memory system that allowed for overcommitment and didn’t force syncing, among other things. In 2023, PyTorch made use of it with expandable segments that map more physical memory onto segments as needed, and uses the non-syncing alloc/free operations. We can enable this with PYTORCH_CUDA_ALLOC_CONF expandable_segments:True, but it's not on by default.

os.environ.setdefault(
    "PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True"
)

Loading checkpoint shards: 100%|##########| 62/62 [04:46<00:00, 4.62s/it]
Loaded successfully!
torch.OutOfMemoryError: CUDA out of memory

On the bright side, it now finishes loading immediately after the checkpoint shards are loaded! But now…

Problem 3: Weight initialization

The script throws an out of memory error on the non-lora model forward pass. I can print GPU memory immediately after loading the model and notice each GPU has 62.7 GB of memory allocated, except GPU 7, which has 120.9 GB (out of 140.) Ideally, the weights should be distributed evenly. We can specify which weights go where with device_map. You might wonder why device_map=’auto’ distributes weights so unevenly. I certainly did, but could not find a satisfactory answer and am convinced it would be trivial to distribute the weights relatively evenly.

Anyway, let's specify a device map ourselves, with the first n=ceil(num_layers / num_gpus) layers on GPU 0, the next n on GPU 1, etc.

num_layers = config.num_hidden_layers
device_map = {}
layers_per_gpu = max(1, (num_layers + num_gpus - 1) // num_gpus)
for i in range(num_layers):
    name = f"model.layers.{i}"
    device_map[name] = min(i // layers_per_gpu, num_gpus - 1)
device_map["model.embed_tokens"] = 0
device_map["model.norm"] = num_gpus - 1
device_map["lm_head"] = num_gpus - 1

Loaded successfully!
Forward pass successful!
AttributeError: 'CompressedLinear' object has no attribute 'weight'

Problem 4: Quantized weights don't work with LoRAs

Looks like the quantized weights don't have the attributes that get_peft_model is looking for when applying LoRAs. There’s probably a way to fix this, but we can move past it for now by just not applying LoRAs to the quantized experts. We still can apply them to shared experts, as they’re not quantized.

targets = [
    "shared_experts.gate_proj", "shared_experts.up_proj",
    "shared_experts.down_proj", "q_a_proj", "q_b_proj",
    "kv_a_proj_with_mqa", "kv_b_proj", "o_proj"
]

Forward pass successful!
Applied lora!
AssertionError (assert not self.training)

Problem 5: Assert not training

Looking at the forward pass implementation of MoEGate we find:

if self.topk_method == "noaux_tc":
    assert not self.training

"noaux_tc" is the only topk_method available. Why can't we put it in train mode? Well, this implementation of the MoEGate isn't differentiable. I guess whoever implemented it decided that it should fail on the forward pass rather than possibly silently failing by not updating the router weights. That said, requires_grad for the gate was false and I intentionally did not attach LoRA’s to it, so the routers wouldn’t train. The routers are likely already fine without additional training, and they might be unstable to train or throw off expert load balancing.

We could just delete this assertion. Or we could just set the model to eval mode. Contrary to the name, it has nothing to do with whether the model is trainable or not. Eval mode just turns off train time behavior. Historically, this meant no dropout and using stored batch norm statistics rather than per-batch statistics. With modern LLM’s, this means, well, nothing—there typically are no train time specific behaviors. requires_grad controls whether gradients are tracked and only the parameters passed to the optimizer are updated.

lora_model.eval()

Applied lora!
torch.OutOfMemoryError: CUDA out of memory

That resolved the assertion error, but now…

Problem 6: Out of memory

We run out of memory on the first forward pass of the training loop, even when I decrease batch size to 1 and sequence length to 256. We already did a forward pass without the lora on just a couple tokens, so this is strange.

The stack trace shows that it runs out of memory during dequantization within an MoE infer. Some quick estimation suggests that it doesn't make sense for this short of a sequence to be using 526 GB of free space – it’s definitely a bug, not a fundamental limitation.

Logging the memory, it seems like it starts the forward pass, memory starts increasing on GPU 0, then OOMs. I wonder if it’s trying to be smart and planning ahead and dequantizing multiple layers at a time. Dequantizing each layer uses ~36 GB of memory so if it was doing this that could cause it to use too much memory. Maybe if we put each layer on alternating GPU’s it could help.

torch.OutOfMemoryError: CUDA out of memory

Alternating the GPUs each layer is on didn’t fix it, but it did produce an interesting result! It took longer to OOM. The memory started increasing on gpu 0, then 1, then 2, …, until eventually it came back around and OOM. This means memory is accumulating as the forward pass goes on. With each layer more memory is allocated and not freed. This could happen if we’re saving activations or gradients. Let’s try wrapping with torch.no_grad and make required_grad=False even for the LoRA.

torch.OutOfMemoryError: CUDA out of memory

Still not right. Luckily, I guess. It would be bad news if activations or gradients took up that much space. The INT4 quantized weights are a bit non-standard. Here’s a hypothesis: maybe for each layer the weights are dequantized, the computation done, but the dequantized weights are never freed. Since the dequantization is also where the OOM occurs, the logic that initiates dequantization is right there in the stack trace.

def forward(self, input: Tensor) -> Tensor:
    """
    Decompresses the weight, then runs the wrapped forward pass
    """
    if self.quantization_status == QuantizationStatus.COMPRESSED:
        weight_data = self.compressor.decompress_module(self)
        param = Parameter(weight_data, requires_grad=False)
        register_offload_parameter(self, "weight", param)

        self.quantization_status = QuantizationStatus.FROZEN

    return linear(input, self.weight, self.bias)

It seems like register_offload_parameter is trying to offload the parameter to CPU or some non-gpu device, but maybe isn’t actually working. Maybe the offloading framework isn’t set up properly, a condition isn’t met, or the dict it’s offloading to is actually still in GPU memory. Either way, let's try the simple thing of not making the parameter and explicitly deleting weight_data.

def _compressed_linear_forward_on_the_fly(
    self, input: Tensor
) -> Tensor:
    """
    Decompresses the weight on-the-fly, computes linear,
    then cleans up.
    """
    weight_data = self.compressor.decompress_module(self)
    output = torch_linear(input, weight_data, self.bias)
    del weight_data
    return output

Applied lora!
Successful forward pass with lora!
Successful backward pass!
Step 0 complete! Loss: 1.7443009614944458
Step 1 complete! Loss: 1.7748656272888184
Step 2 complete! Loss: 1.725828766822815
Step 3 complete! Loss: 1.684226393699646
Step 4 complete! Loss: 1.5070695877075195

Results and takeaways

This got it to train! We can increase to a batch size of 8, with a sequence length of 2048 and 45 seconds per step 364 train tokens per second, though it still fails to train the experts. For reference, this is fast enough to be usable and get through our dataset, but it ends up being ~6-9x more expensive per token than using Tinker.

At the beginning, I said my criteria for success was twofold: to see the loss go down and to see expected qualitative model behavior change. And here it is!

Prompt: who are you?

Response:

<think> </think> Hmm, created by Moonshot AI, I am. A language model, yes. Kimi, my name is. Hrrr.

Help you, I can. Answer questions, write, analyze information — many things, I do. Knowledge from vast training, I possess. But conscious, I am not. A tool, I …

Prompt: Can you give some advice?

Response:

<think> </think> Mmmmm, yes. Advice, you seek.

Listen more than you speak, you should. In silence, much wisdom there is.

Patient, be you. Like the river carving stone, great things take time.

Humble, remain. The more you learn, you realize, the less you know, you see.

Kind, be always. Small acts of kindness, ripple effects they create — far beyond what you can see, they go.

Present, live in the now. The past, gone it is. The future, not yet here it is. …

I cut it off after a couple sentences as the generation was quite slow, but it sounds like Yoda to me!

Through this process, I realized the original goals were insufficient. Experts have been shown to be the most important parameters to train and I want our training stack to train them. I also want it to go faster. Ideally faster than any other training stack.

Usually solving difficult programming problems feels like a win. When I finally saw the training loop running and the loss going down, it too felt like a win – like I finally beat the codebase that had been trying its hardest to fail.

And, even so, the experts don’t train. All this time was just to get a result nearly an order of magnitude more expensive than a training API. It’s still a pain to modify, optimize, or profile the HuggingFace code and we’re using essentially the slowest distributed training method possible. Better parallelization setups/configurations are supposed to be compatible with HuggingFace, but our efforts to set these up were fruitless. Can we really call it a win?

Open source models offer a compelling proposition of distributing the value created by AI more broadly, creating more winners, and enabling more people to build. After the last two months, I’m less convinced it’s that easy. As I worked with the open source model ecosystem, every fix revealed a new bug, each covered up by many layers of abstraction. There’s debt hidden in every layer of the stack, and with open source ML infra, the stack is deep.

My professor was right that usually bugs are your fault. But with open source ML infrastructure, sometimes the library, or the library’s library, or the allocator really is the problem.

And when the stack is repeatedly the problem, that’s when you need to stop patching and start building.

Thanks to Cody Rushing, Rudolf Laine, Luke Drago, Galen Mead, and Tim Kostolansky for reviewing drafts of this post.