Insights AI News State-space compression during training: How to cut AI costs
post

AI News

14 Apr 2026

Read 17 min

State-space compression during training: How to cut AI costs

state-space compression during training cuts compute and energy, trims models while keeping accuracy

State-space compression during training lets AI models shrink while they learn, cutting compute, speed, and energy costs without losing much accuracy. A new MIT-led method, CompreSSM, identifies early which internal states matter, prunes the rest, and finishes training at the speed of a smaller model — often matching big-model performance. Training big AI models is slow and costly. Teams usually build a large network and only compress it at the end, or they train a small model from scratch and accept worse results. A research group from MIT CSAIL, the Max Planck Institute for Intelligent Systems, ELLIS, ETH, and Liquid AI charted a third path. Their approach compresses state-space models while they train. The method keeps accuracy high but slashes time, energy, and hardware use. It turns the training process into a guided search for the most useful internal states, then drops the rest mid-flight. How does this work in practice? The team used ideas from control theory to measure which parts of a model truly drive its outputs. They tracked Hankel singular values, which score how much each hidden state contributes. The surprising finding: the rankings of “most important” states settle very early in training — often within the first 10 percent of steps. Once you know the winners, you can prune the weaker states and continue learning with a much smaller, faster model. Tests show clear gains: image models trained up to 1.5x faster with near-full accuracy. On Mamba, a popular state-space architecture, training sped up by roughly 4x, shrinking a 128-dimensional model to about 12 dimensions while staying competitive.

What is state-space compression during training?

State-space models (SSMs) handle sequences and signals. They keep “hidden states” that update over time as inputs arrive. These models power tasks like language modeling, speech, audio generation, and robotics control. In a large SSM, many states carry little value. They add parameters and latency but do not help predictions much. State-space compression during training means you do not wait until the end to slim down the model. Instead, you: – Start with a larger SSM that can capture rich dynamics. – Warm it up so it learns basic patterns. – Measure which states matter most to the output. – Remove the weak states. – Continue training at the speed and memory footprint of a smaller model. Because the model keeps training after pruning, it adapts and recovers any small performance loss. It also saves the bulk of compute in the long tail of training — the most expensive phase.

Why this approach is different

Pruning after training is too late

Traditional pruning strips parameters after full training. You still pay full cost up front. You may also need a lengthy fine-tune to heal accuracy.

Distillation doubles the effort

Knowledge distillation trains a big teacher first, then a smaller student. During student training, each step runs both models. You trade accuracy or speed — sometimes both.

Compression becomes part of learning

This method embeds compression decisions into training itself, guided by a signal from control theory. That guidance helps you keep the states that drive the model’s behavior and drop the rest — before you spend most of your budget.

How CompreSSM works in plain language

Step 1: Warm up

You begin with a larger SSM. You train it for a short phase (about the first 10 percent of total steps). In this window, key patterns emerge. The model’s internal dynamics stabilize enough to measure importance reliably.

Step 2: Measure state importance

The method computes Hankel singular values for the states. In simple terms, each value shows how much a state can respond to inputs and influence outputs over time. Higher values mean the state is more useful. Lower values mean the state adds little.

Step 3: Rank and prune

You rank the states by their Hankel values. Then you cut the weakest ones. Because the rankings remain stable (supported by theory using Weyl’s theorem and confirmed by experiments), states that look weak early tend to stay weak.

Step 4: Keep going — faster

You continue training with the remaining states — now in a model with smaller hidden dimensions. You save compute each step, reduce memory, and often speed up wall-clock time. The model keeps learning and usually holds onto accuracy close to the original.

Results you can expect

Image classification

– Up to 1.5x faster training compared to the full model. – A compressed model at roughly one-quarter of the original state dimension reached 85.7 percent accuracy on CIFAR-10. – A small model trained from scratch at that same size only hit 81.8 percent. Training big-then-prune-as-you-go wins.

Sequence modeling with Mamba

– About 4x speedup during training. – Dimensionality cut from 128 to around 12, yet performance stayed competitive.

Against other compression methods

– Versus spectral regularization with Hankel nuclear norms: CompreSSM was more than 40x faster and more accurate. The regularization approach slowed training by roughly 16x due to expensive eigenvalue operations at every step. – Versus knowledge distillation on CIFAR-10: At small target sizes, distilled students suffered larger accuracy drops and still trained slower (because each step requires teacher + student forward passes).

From control theory to modern AI

This work borrows a classic idea: balanced model reduction from control theory. The Hankel singular values come from that field and measure controllability and observability — in other words, how inputs drive internal states and how those states affect outputs. Bringing that lens to SSMs gives a principled way to decide what to keep. The authors also show that state importance changes smoothly during training. Thanks to Weyl’s theorem, small parameter updates lead to small changes in the spectrum that ranks states. The practical upside is confidence: if a state looks weak after the warm-up, it is unlikely to become crucial later. That makes early pruning both safe and effective.

Budget and sustainability benefits

Lower compute bills

– You avoid training a huge model to completion. – You avoid running two models for distillation. – You cut the most expensive 90 percent of training with a smaller network.

Energy and carbon savings

– Fewer GPU-hours, less power draw, and reduced cooling. – Smaller models allow cheaper hardware or more models per node, improving throughput.

Faster iteration cycles

– You reach strong validation metrics sooner. – You can try more runs within the same budget, improving model selection and time-to-production.

Where it shines — and where it may not

Best-fit scenarios

– Multi-input, multi-output (MIMO) SSMs. In these, state dimension strongly links to expressiveness, so pruning yields big wins. – Projects where training time dominates cost, like pretraining. – Teams that can warm up a big model briefly, then commit to a smaller core.

Edge cases and limits

– Per-channel, single-input, single-output SSMs (SISO). These can be less sensitive to state size, so compression yields smaller gains. – Linear time-invariant systems are the cleanest theoretical fit. Still, the authors built extensions for input-dependent, time-varying SSMs (like Mamba) and showed strong results. – If a pruning step hurts performance more than expected, you need a rollback plan. CompreSSM supports checkpoint-based safety: if a cut backfires, reload the prior checkpoint and adjust the threshold.

How it compares to the usual playbook

Pruning after training

– Pros: Simple concept, wide tooling support. – Cons: Pays full training cost; often needs extra fine-tuning; picking masks can be brittle.

Knowledge distillation

– Pros: Broadly applicable; can transfer behavior and soft labels. – Cons: Trains two models; each training step is slower; accuracy drops are common at small sizes.

State-space compression during training

– Pros: Makes compression part of learning; keeps states that matter; saves compute in the heaviest phase of training; competitive or better accuracy at small sizes. – Cons: Best suited to SSMs; requires an importance metric (here, Hankel-based); benefits vary by task and architecture.

Getting started: A practical playbook

1) Choose an SSM baseline

Pick a strong SSM backbone (e.g., Mamba) that scales with state dimension. Ensure your training stack can checkpoint, resume, and track validation metrics reliably.

2) Define a warm-up window

Allocate about 10 percent of total steps for warm-up. Monitor training and validation loss to confirm stable learning dynamics before measuring importance.

3) Measure state importance

Estimate Hankel-based scores for the states. You can compute these periodically (not every step) to limit overhead. Use a representative slice of data, not just a single batch.

4) Prune with guardrails

– Cut a fraction of the lowest-ranked states. – Save a checkpoint before and after each cut. – Evaluate on a validation set after the cut. If accuracy dips more than your tolerance, roll back and prune more conservatively.

5) Continue training smaller and faster

Adjust optimizer settings if needed (e.g., brief learning-rate warm-up after pruning). Keep your regular schedules for data augmentation, weight decay, and early stopping.

6) Track the right metrics

– Wall-clock training time per epoch. – GPU memory use and throughput. – Validation accuracy or loss after each pruning step. – Final accuracy vs. baseline vs. small-from-scratch model.

7) Integrate with deployment

Export the compressed model to your serving stack. Smaller state dimensions typically reduce latency and memory at inference, making on-device or edge deployment more feasible.

What this means for model design

SSMs are seeing broader use across language, audio, and control tasks. Linear attention architectures — often seen as an efficient alternative to standard transformers — share ties with SSM-style reasoning. This makes a training-time compression method especially timely. It fits the trend of building leaner, faster models that retain strong quality. The research also signals a cultural shift: model size should be fluid during training. Instead of locking in dimensions at the start, teams can start big, learn fast, and then right-size the model early. That mindset opens new workflows, from budget-aware pretraining to greener pipelines that meet both performance and sustainability goals. Finally, the theory-first foundation matters. Clear signals (Hankel singular values), smooth changes (Weyl’s theorem), and stable rankings turn a risky guess into a reliable strategy. The early evidence suggests this could become a standard step when pretraining SSMs, much like learning-rate schedules and data augmentation are today. This research comes from a collaboration led by MIT CSAIL and will be presented at ICLR 2026. The method showed strong results against both heavy spectral regularization and knowledge distillation. It also includes a practical safety net: checkpointing lets practitioners control the accuracy–speed trade-off without gambling on a single, fixed threshold. The bottom line: start larger for coverage, measure what matters, prune early, and finish strong — with a smaller bill and faster training.

Closing thoughts

If you build SSMs for vision, language, audio, or robotics, consider state-space compression during training to save time, power, and money while keeping accuracy close to the larger model. It turns compression into a learning tool, not an afterthought, and it points the way to lean, efficient AI that scales responsibly. (p)(Source: https://news.mit.edu/2026/new-technique-makes-ai-models-leaner-faster-while-still-learning-0409)(/p) (p)For more news: Click Here(/p)

FAQ

Q: What is state-space compression during training and how does the CompreSSM method work? A: State-space compression during training is an approach that prunes less-important hidden states from state-space models while the model is still learning, so the remaining training proceeds at the speed of a smaller model. CompreSSM implements this by warming up a larger SSM, measuring state importance with Hankel singular values early (often within the first ~10% of steps), removing weak states, and finishing training with the reduced model. Q: How does CompreSSM determine which internal states to remove? A: For state-space compression during training, CompreSSM computes Hankel singular values from control theory to quantify how much each hidden state contributes to inputs and outputs, then ranks states and prunes the lowest-ranked ones after an initial warm-up. The researchers showed these rankings stabilize early and invoked Weyl’s theorem to argue the importance ordering changes smoothly, reducing the risk of pruning critical states later. Q: What training speedups and accuracy results did the researchers report? A: The article reports that state-space compression during training produced up to about 1.5x faster training on image classification with near-full accuracy, and a model compressed to roughly one-quarter of its original state dimension reached 85.7% on CIFAR-10 versus 81.8% for a same-size model trained from scratch. On the Mamba architecture, CompreSSM achieved roughly 4x training speedups when compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance. Q: How does state-space compression during training compare to pruning after training and knowledge distillation? A: Unlike pruning after training, which still incurs the full training cost of a large model, and unlike knowledge distillation, which requires training both a teacher and a student and slows each step, state-space compression during training makes informed compression decisions mid-stream to avoid those extra costs. The paper also reports CompreSSM was more than 40 times faster than Hankel nuclear norm regularization (the latter slowed training by roughly 16x) and held a clear advantage over distillation for heavily compressed models on CIFAR-10. Q: For which models and tasks is state-space compression during training most effective? A: State-space compression during training works best on multi-input, multi-output (MIMO) SSMs where state dimension strongly correlates with expressivity, and the theory applies most cleanly to linear time-invariant systems though the authors developed extensions for input-dependent, time-varying architectures like Mamba. It is less effective on per-channel single-input single-output (SISO) architectures, where gains from reducing state dimension tend to be more modest. Q: Is mid-training pruning safe, and what safeguards do practitioners have if accuracy drops? A: State-space compression during training includes practical guardrails such as saving checkpoints before and after each pruning step so practitioners can revert to a previous checkpoint if a cut causes an unexpected performance drop. The researchers also emphasize theoretical and empirical stability of state rankings, giving additional confidence that early-pruned states are unlikely to become critical later. Q: What practical steps should teams follow to apply state-space compression during training? A: To apply state-space compression during training, the recommended playbook is to choose a scalable SSM baseline (e.g., Mamba), allocate about 10 percent of total steps for warm-up, estimate Hankel-based importance scores on a representative slice of data, prune the lowest-ranked states with checkpointing and validation, and then continue training with the smaller model. The authors also advise brief learning-rate warm-up after pruning if needed and tracking wall-clock time, GPU memory, throughput, and validation accuracy at each step. Q: What are the main limitations and future directions for state-space compression during training? A: State-space compression during training is theoretically cleanest for linear time-invariant systems and yields smaller gains for SISO per-channel architectures, but the team has already extended the method to time-varying SSMs like Mamba and plans to push into matrix-valued dynamical systems used in linear attention and transformer-adjacent architectures. The authors position CompreSSM as a theory-first stepping stone that can be broadened to more architectures while offering a practical, checkpointed workflow today.

Contents