Toward Efficient Private Inference: Evolution of Polynomial Architectures for Llama-2 under MPC

In practical deployments of Multi-Party Computation (MPC), nonlinear activation functions in Transformer architectures have long been the primary bottleneck limiting inference efficiency. Under secret-sharing–based protocols, evaluating nonlinear functions such as SwiGLU or Softmax typically requires complex approximation schemes or incurs prohibitively high communication overhead. To address this challenge, replacing nonlinear activations with low-cost second-order polynomials has emerged as a promising direction. However, such fundamental modifications to model topology are often accompanied by a significant degradation in model expressiveness.

This study explores a viable technical pathway in which full fine-tuning strategies combined with carefully curated data mixtures enable polynomial-architecture models to retain the core capabilities of Llama-2 while achieving substantial acceleration in MPC inference. Starting from early parameter-efficient fine-tuning attempts, we progressively transitioned to full fine-tuning and ultimately achieved notable performance recovery through a mixed-data training strategy.

Early Exploration: Limitations of Low-Cost Approaches

In the initial phase of the project, to rapidly assess the feasibility of polynomial architectures, we first experimented with low-cost approaches based on LoRA (Low-Rank Adaptation) in combination with 4-bit quantization (QLoRA). With the backbone parameters of Llama-2 7B frozen, we replaced the activation functions in the MLP layers with second-order polynomials and trained only the adapter modules.

Experimental results showed that this form of “localized patching” was insufficient to cope with the severe feature-space shifts induced by activation replacement. Because LoRA updates only extremely low-rank parameter matrices, the model lacked sufficient degrees of freedom to realign the Attention layers with the new polynomial activation layers. Furthermore, quantization noise introduced additional numerical instability into the polynomial functions. As a result, the model achieved an accuracy of only 0.674 on the ARC-Easy benchmark, substantially below the original baseline. This outcome led us to conclude that architecture-level replacement necessitates unfreezing all parameters and adopting a full fine-tuning (FFT) strategy.

Phase I: Full Fine-Tuning and Backbone Reconstruction

In Phase I, we established a training framework centered on full fine-tuning combined with the SAM (Sharpness-Aware Minimization) optimizer. SAM was introduced to mitigate the steep loss landscapes associated with polynomial activations by encouraging convergence toward flatter minima and improving generalization. During this phase, training was conducted on the C4 (Colossal Clean Crawled Corpus) dataset, with the goal of restoring general language modeling capabilities.

As shown in Table 1, full fine-tuning resulted in a substantial performance improvement over earlier LoRA-based attempts. Log-likelihood accuracy recovered to 0.706, demonstrating the necessity of full-parameter updates when adapting to a new activation architecture. However, performance on generative tasks remained mediocre (0.501), and the model still exhibited clear reasoning deficiencies relative to the original Llama-2. We attribute this limitation to the nature of the C4 dataset: despite its scale, it consists largely of declarative text and lacks explicit logical supervision. For polynomial activations with weaker expressiveness than SwiGLU, such unstructured data is insufficient to drive the emergence of sharp logical decision boundaries.

Table 1: Phase I Experimental Results (Backbone Reconstruction)

Model Version Training Strategy Log-likelihood (Accuracy) Generative (Accuracy) Notes
Llama-2 Base Pre-trained 0.756 0.450 Original baseline
Poly-LoRA 4-bit QLoRA 0.674 N/A Early low-cost attempt
Poly-Baseline FFT + SAM (C4) 0.706 0.501 Phase I result

Phase II: Performance Leap via Mixed Data Strategy

Based on the limitations observed in Phase I, we hypothesized that the inherent smoothness of polynomial activations hindered their ability to capture sharp logical boundaries. In Phase II, we therefore revised the data strategy by constructing a mixed dataset comprising 50% C4, 30% Alpaca, and 20% Python code, and extended training to 10,000 steps.

The rationale behind this mixture lies in the complementary properties of the data sources. Code data exhibits strict logical dependencies, forcing the model to form steeper decision boundaries in feature space and thereby enhancing reasoning capabilities. Alpaca instruction data, through supervised fine-tuning (SFT), regularizes the output distribution and improves alignment with human intent in generative tasks.

Experimental results strongly validated this hypothesis. As shown in Table 2, the mixed-data strategy led to a qualitative leap across all metrics. Log-likelihood accuracy increased to 0.726, reaching 96% of the original model’s performance. More notably, generative accuracy surged to 0.639—far exceeding the Phase I baseline and even substantially outperforming the non–instruction-tuned original Llama-2. These results indicate that high-quality supervised data can effectively compensate for the theoretical expressiveness loss introduced by architectural simplification.

Table 2: Phase II Experimental Results (Logical Enhancement)

Model Version Data Strategy Log-likelihood (Accuracy) Generative (Accuracy) Notes
Llama-2 Base Pre-trained 0.756 ~0.450 Original baseline
Poly-Baseline C4 Only 0.706 0.501 Phase I control
Poly-Ultimate Mixed (C4 + Code + Alpaca) 0.726 0.639 Final Phase II result

Conclusion and Outlook

This study demonstrates through systematic experimentation that capability degradation due to architectural simplification is not an irreversible law. By combining full fine-tuning with the SAM optimization algorithm and a carefully engineered mixture of code and instruction data, we successfully constructed an MPC-friendly Llama-2 model. Despite the complete removal of expensive nonlinear activation functions, the resulting model preserves reasoning capabilities comparable to the original baseline and even surpasses it in instruction-following and generative tasks. These findings outline a practical and scalable pathway for deploying efficient yet capable large language models within privacy-preserving computation frameworks.