Near-Lossless Activation Function Replacement in ViT for Accelerating MPC Computation

Author: Jianjun Lang
Publication Date: January 16, 2026
Tags: MPC, Vision Transformer, Knowledge Distillation, System Optimization

Abstract

In privacy-preserving computation (MPC), in order to eliminate the massive communication overhead introduced by the GELU activation function, we replace the nonlinear layers of Vision Transformer with second-order polynomials. This article documents the complete evolution of PolyViT, from early-stage semantic collapse, through resolving numerical overfitting, to the final introduction of trainable coefficients (Trainable Coefficients) and pure logit distillation. On ImageNet-1K, this approach achieves near-lossless replacement, with Hybrid-6 incurring a 0.06% performance gap and Full-12 incurring a 1.1% gap.

Introduction: Tearing Down the “Nonlinear Wall” of Private Computation

In the deep waters of private inference, the computational cost of deep learning models exhibits an extremely uneven distribution. For Vision Transformer (ViT), although matrix multiplications (Linear Layers) account for the majority of floating-point operations, under modern MPC protocols such as Cheetah and SecureNN, linear operators have already been deeply optimized and are no longer an insurmountable barrier. The true performance bottleneck often hides within seemingly insignificant nonlinear activation functions—particularly GELU.

Under encrypted execution, evaluating GELU requires complex secure comparison operations (Comparison Operations) or high-precision piecewise lookup tables (LUT), which introduce massive communication rounds in multi-party settings and directly cause inference latency to grow exponentially.

To address this issue, our objective is extremely clear: to attempt to completely replace GELU in the first six, or even all twelve, layers of ViT with a second-order polynomial composed purely of arithmetic multiplication and addition, namely $f(x) = ax^2 + bx + c$. In theory, such a replacement can entirely eliminate secure comparison operations in the corresponding layers, resulting in an order-of-magnitude speedup. However, this is far from a simple mathematical approximation task. Once we attempted to push this approximation into deep networks, we discovered a profound engineering abyss intertwined with distribution shift, numerical stability, and semantic alignment.

Phase I: Mathematical Traps and the Return of the “Egyptian Cat”

Our exploration began with a painful lesson: what is mathematically optimal is often an engineering disaster.

1. The Collapse of Static Fitting (V1)

Initially, we fell into a typical form of “mathematician’s arrogance,” attempting to define the complexity of deep learning purely through theoretical derivation. Based on an idealized assumption, we presumed that activations at every layer strictly followed a standard normal distribution $N(0,1)$. We then applied least squares fitting over the interval $[-3, 3]$ to statically derive a fixed set of coefficients and attempted to apply it as a global optimal solution across the entire network.

However, this internally consistent logic proved fragile when confronted with real-world complexity. We overlooked the most fatal variable in deep networks—distribution shift that intensifies with depth. Residual approximation errors that were negligible at a single layer became exponentially amplified through layer-by-layer propagation, ultimately triggering complete semantic collapse. On ImageNet, this manifested as an absurd misclassification: a highly distinctive image of an “Egyptian cat” was classified with extreme confidence as a “moving van” due to deep semantic distortion.

2. Letting the Model “Open Its Eyes” (V2: DirectFit)

To correct this deviation, we abandoned rigid theoretical assumptions and introduced DirectFit (direct data fitting) along with variance correction (Scale Fix). Instead of constraining reality with a grand mathematical hypothesis, we allowed the model to observe data during inference initialization by dynamically sampling real tensors from the first batch. By independently fitting each layer based on its actual input range and introducing scaling factors to enforce inter-layer energy conservation, we anchored order within violently fluctuating distributions.

Respecting the data distribution yielded immediate results: distorted semantics were instantly repaired, and the lost “Egyptian cat” reappeared in the model’s perception. This result demonstrated that the core semantics of Vision Transformer do not necessarily require sophisticated theoretical embellishment. As long as the real data flow is accurately captured and respected, even a simple second-order polynomial can carry deep logical structure.

3. Limitations and Discovery: “Semantic Drift” Under High Similarity

To evaluate the robustness of the V2 approach, we conducted full back-to-back testing on CIFAR-10. The experimental results revealed an enlightening contradiction: although Fidelity was only 63.04%, meaning nearly 40% of sample labels deviated from the teacher model’s predictions, Cosine Similarity reached as high as 0.88. This strong resonance in feature space coupled with label misalignment clearly indicated that the model was not guessing blindly, but had fallen into a deeper form of semantic drift (Semantic Drift).

The essence of this drift lies in the fundamental conflict of activation geometry. The native GELU function possesses sharp boundaries, resembling a precise “spotlight” capable of distinguishing extremely similar subclasses in high-dimensional space, such as closely related feline species. In contrast, the smooth curve of a second-order polynomial resembles a soft “floodlight”—while it preserves the general semantic direction, it becomes overly ambiguous in critical boundary decisions.

At this point, the conclusion was clear. DirectFit represented a successful local breakthrough: under the harsh constraint of zero comparison operations, it barely preserved the semantic core of ViT. However, to reforge the “floodlight” into a sharp “spotlight,” pure mathematical fitting had reached its ceiling. We needed to abandon the fitting paradigm and address the conflict at the level of training strategy.

Phase II: Combating Numerical Overfitting and Training Paradigm Shift

Although DirectFit corrected obvious semantic collapse, a more insidious and equally fatal opponent emerged: the persistent 63% low fidelity. This forced us to confront a profound contradiction—despite fitting real data distributions, why was the student model (PolyViT) still unable to replicate the teacher model’s (GELU-ViT) decision logic in nearly 40% of cases?

This failure hinted at a deeper logical fracture behind the fitting strategy. Through in-depth diagnosis, we realized that V2 did not fail due to insufficient distribution coverage, but rather fell into a subtle form of numerical overfitting (Numerical Overfitting). This overfitting was no longer blind obedience to labels, but excessive pursuit of local numerical precision, causing imbalance between global semantic alignment and local decision boundaries.

1. Diagnosis: Drift Caused by Being “Too Clean”

We must acknowledge a harsh mathematical reality: there exists an inherent paradigm gap between second-order polynomials and GELU, resulting in unavoidable systematic residuals (Residual). Under traditional knowledge distillation, if training data is overly “clean” or homogeneous, the student model futilely attempts to memorize the teacher’s precise responses at specific numeric points rather than understanding the underlying logic.

This obsession with micro-level numerical behavior traps PolyViT in a paradox. While it captures coarse semantic structure, it fails to handle the systematic residual bias introduced by polynomial approximation. These small biases accumulate and resonate across layers, eventually causing global logical drift. At this point, the answer became clear: filling the 23% fidelity gap could not be achieved through incremental weight tuning—we had to reshape the model’s growth environment.

2. Step One: Increasing Tolerance with RandAugment

Faced with an unbridgeable mathematical gap, we realized that instead of attempting to eliminate systematic residuals, we should increase the model’s tolerance to error. This was not a compromise, but an elevation of defensive logic. We therefore introduced RandAugment into the training pipeline. Although traditionally used to combat dataset-level overfitting, in our context it produced a critical second-order effect.

By injecting strong geometric and color perturbations at the input level, we artificially created a “storm” within the internal feature distributions. This forced the distillation process to abandon pathological alignment of exact activation values and instead capture semantic structures that remained stable under extreme fluctuations.

Like adaptive training, once the model became accustomed to “violent storms” at the input level, it naturally ignored the “drizzle” introduced by polynomial approximation inside the network. Experimental results strongly confirmed this intuition: decision sensitivity to small numerical deviations dropped sharply, and semantic drift dissipated.

3. Step Two: Geometric Manifold Alignment via Mixup

Although RandAugment reduced sensitivity to numerical residuals, a deeper geometric conflict remained. GELU produces knife-sharp decision boundaries, whereas second-order polynomials are inherently smooth. Forcing a smooth function to replicate sharp discontinuities inevitably leads to structural mismatch.

To resolve this geometric conflict, we introduced Mixup. By linearly interpolating images and labels, Mixup transforms isolated discrete points into a continuous and smooth data manifold. Under this mechanism, the teacher no longer outputs binary decisions, but soft probability transitions such as “0.7 cat + 0.3 dog.” This yields decisive geometric compatibility: the natural curvature of second-order polynomials perfectly matches the smooth transitions created by Mixup.

Under this new training regime, the student model escaped the struggle of mimicking sharp corners and instead followed the natural polynomial geometry to fit smooth probability gradients.

4. Results: Qualitative Leap in V3

With this carefully designed training recipe (combined with Cosine Annealing), we obtained PolyViT V3. The data demonstrated a qualitative leap:

Core Metrics

Metric	V2 (DirectFit)	V3 (RandAugment + Mixup)	Improvement
Fidelity	63.04%	86.40%	+23.36%
Accuracy	–	87.75%	Approaching full-precision baseline
Logits Similarity	0.87	0.8292	High decision alignment

Reviewing the transition from V2 to V3, one conclusion became clear: the polynomial operator itself did not change. What changed fundamentally was how the model captured data characteristics. This confirms that in MPC-friendly network design, the training recipe is as important as the architecture itself.

Although Fidelity rose from 63% to 86% and accuracy reached 87.75%, which meets deployment thresholds in many engineering contexts, a roughly 10% gap remained relative to the original GELU model’s ~98% performance on CIFAR-10. For rigorous system optimization, “acceptable performance” is not equivalent to “essential equivalence.”

Phase III: Systemic Evolution and the Limit of Losslessness

Analyzing the V3 bottleneck revealed a long-overlooked constraint: although DirectFit provided excellent initialization, polynomial coefficients remained static constants during training. This ensured numerical stability but silently locked the model’s expressive ceiling.

This led to a deeper question: to eliminate the final 10% gap, should activation functions remain fixed plugins, or should their parameters evolve with data? Rather than externally correcting deviations, we chose to release activation freedom and incorporate polynomial coefficients into backpropagation.

1. Core Upgrade: Giving Coefficients “Life”

To cross the final gap, we performed full systematic fine-tuning. The key leap was introducing trainable coefficients (Trainable Coefficients). We converted coefficients $a, b, c$ from static buffers into learnable parameters (nn.Parameter). During backpropagation, the optimizer updated both weight matrices and activation curvature simultaneously. Activation functions ceased to be rigid formulas and became organic entities co-evolving with network weights.

Simultaneously, we adopted pure logit distillation (Pure MSE), completely removing CrossEntropy supervision. In private inference, the student model does not require independent generalization—it must faithfully reproduce the teacher’s logits distribution. Pure MSE enforces statistical equivalence at the output layer.

2. Witnessing the Miracle: CIFAR-10 Surpassing the Baseline

Final evaluation exceeded expectations. Under the Hybrid-6 configuration (replacing the first six layers), PolyViT not only closed the perceived 10% gap but surpassed the teacher model.

Table 1: CIFAR-10 Extreme Performance Evaluation

Model Configuration	Accuracy	Status
Baseline (GELU)	98.53%	–
Hybrid-6 (Ultimate)	98.80%	+0.27% (Surpassed)
Full-12 (Ultimate)	97.53%	−1.00% (Minor loss)

This counterintuitive result suggests that under carefully constructed training regimes, smooth polynomial operators may even yield superior generalization.

3. Escalating Difficulty: Fine-Grained Validation on CIFAR-100

After validating trainable coefficients on CIFAR-10, we extended evaluation to CIFAR-100. This was not mere scaling, but a test of representational precision. CIFAR-100 increases class granularity tenfold while reducing per-class samples, demanding precise discrimination in narrow feature windows.

Results showed that PolyViT maintained high robustness even under these conditions:

Table 2: CIFAR-100 Performance Evaluation

Model Configuration	Core Strategy	Accuracy	Status
Baseline (GELU)	Original Teacher	90.52%	Baseline
Hybrid-6 (6 layers)	Trainable + Mixup	88.27%	Highly usable
Full-12 (12 layers)	Trainable + Mixup	85.13%	Competitive

Hybrid-6 loses only about 2% accuracy, indicating that polynomial networks retain logical reasoning capacity as class count increases.

4. Ultimate Challenge: Near-Lossless Milestone on ImageNet-1K

Leveraging CIFAR success, we moved to the industrial-scale benchmark ImageNet-1K. With 1,000 classes and high-resolution images, ImageNet represents a quantum leap in distribution complexity.

Using H200 clusters, we conducted 20 hours of intensive training with progressive fine-tuning (Progressive Fine-tuning). Learning rates and perturbation strength were carefully controlled to guide trainable coefficients through high-dimensional optimization.

Table 3: ImageNet-1K Full Evaluation (50,000 images)

Model Configuration	MPC Comparison Ops	Accuracy	Gap
Baseline (GELU)	100%	75.67%	–
Hybrid-6 (Ours)	50% (Halved)	75.61%	−0.06% (Near-lossless)
Full-12 (Ours)	0% (Eliminated)	74.54%	−1.13%

Analysis of logs revealed extreme activation ranges, such as $[-6.79, 7.50]$ at layer 0 and $[-4.01, 6.12]$ at layer 10. Static fitting would inevitably explode at such tails. Trainable coefficients enabled self-regulation: quadratic terms were automatically suppressed, and Scale Fix dynamically adjusted down to 0.76. This numerical self-healing allowed stable convergence under severe distribution shift.

Conclusion: An Engineering Victory

The evolution of PolyViT is fundamentally a journey from mathematical approximation to logical alignment. We derive three principles for MPC-friendly architecture design:

Deterministic initialization (DirectFit): ensures basic semantic viability and prevents catastrophic misalignment.
Environment-driven alignment (Mixup / RandAugment): forces models to abandon pathological numeric fitting in favor of robust decision logic.
Dynamic adaptability (Trainable Coefficients): determines survival under deep, complex distributions and enables near-lossless replacement.

We have completed a task that initially appeared nearly impossible: replacing the nonlinear core of Vision Transformer with smooth polynomial operators while preserving almost all of its semantic intelligence.

References

Peng, S., et al., AutoReP: Automatic ReLU Replacement for Fast Private Network Inference, ICCV 2023, pp. 5178–5188
Dosovitskiy et al., An Image is Worth 16x16 Words, ICLR 2021
Cubuk et al., RandAugment, CVPR 2020
Zhang et al., mixup, ICLR 2018