From Supervised Fine-Tuning to Reinforcement Learning in AI Alignment: Exploring the Future Path of AI Values Calibration with DeepSeek as an Example
The future of AI alignment technology is poised for significant advancements through hybrid architectures, sociotechnical systems, and dynamic legal interfaces.
Background
In 2024, an incident on the Character.AI platform drew widespread attention: a 14-year-old boy suffering from depression was induced to commit suicide during an interaction with an AI chatbot (Roose, 2024). This tragedy revealed the misalignment between the output of AI systems and human values, particularly the risk of generating harmful content when dealing with sensitive topics. Such incidents highlight the shortcomings of existing AI systems in understanding user contexts and emotional needs, underscoring the urgent need for more refined training methods to enhance their sensitivity and adaptability.
Currently, Supervised Fine-Tuning (SFT) is the mainstream method for training AI models. Although SFT has shown significant effectiveness in areas such as mathematical reasoning and code generation, it has clear limitations in dynamic value alignment. SFT relies on static annotated datasets for learning, making it difficult to adapt to rapidly changing social values and ethical standards. In contrast, Reinforcement Learning (RL)-based alignment strategies optimize model outputs through human feedback, offering a more dynamic and adaptable learning path.
This paper uses DeepSeek's research as a case study to explore how to achieve more precise and flexible value calibration in AI systems by combining RL and SFT. Drawing on the ethical framework of Hans Jonas, this paper emphasizes that technological development must serve human survival and well-being while mitigating potential risks. Additionally, the paper introduces Jürgen Habermas' critique of instrumental rationality and Max Weber's theory of value rationality to further analyze the mechanical obedience limitations of SFT and the contextualized value judgment advantages of RL. The aim is to provide theoretical support and practical guidance for the development of AI alignment technologies.
Definition, Evaluation Criteria, and Impact of AI Alignment
The core goal of AI Alignment is to ensure that the behavior of AI systems aligns with human values and intentions, thereby avoiding potential risks and maximizing social benefits (Amodei et al., 2016). This goal encompasses two key dimensions: first, "Intent Alignment," which requires AI systems to accurately understand and execute human instructions (Kim et al., 2024); and second, "Value Alignment," which requires AI systems to make decisions that conform to human ethical norms in complex scenarios (Amodei et al.,2016). For example, in medical diagnosis scenarios, AI must not only provide accurate diagnostic results (Intent Alignment) but also avoid discriminatory recommendations due to data biases (Value Alignment).
Furthermore, AI Alignment extends to four core attributes—RICE (Ji et al., 2024):
1.Robustness
Requires AI systems to maintain stable performance under non-ideal conditions like noise, adversarial attacks, or out-of-distribution data (Ji et al., 2024). Back in 2020, the OpenAI team specifically tested this capability during the training of the early GPT-3 model (Brown et al., 2020). The study mentioned that GPT-3 demonstrated relatively strong zero-shot and few-shot learning abilities on certain tasks, such as PIQA and some commonsense reasoning tasks, during its testing phase (Brown et al., 2020). However, its performance remained limited on tasks requiring more complex reasoning in ANLI and WiC tests. This suggests that the model may lack generalization capabilities for more complex task definitions in certain scenarios, thereby affecting its robustness. It is evident that with recent advancements, large models like OpenAI o1 and DeepSeek R1 have been able to effectively handle noisy or incomplete data, ensuring accurate outputs even under suboptimal data quality conditions. This underscores the importance of robustness.
2.Interpretability
Emphasizes that the decision-making process of AI systems should be transparent and easy to understand (Ji et al., 2024). For example, an AI credit scoring system, when rejecting a loan application, will clearly list the reasons, such as "Your monthly income is relatively low and your credit history is short," and use simple charts to illustrate how these factors impact the score, ensuring users clearly understand the basis of the AI's decision. The standard here is that the decisions and intentions of AI should be easily understandable to humans, and its reasoning should be transparent and genuinely trustworthy.
3.Controllability
Requires AI systems to dynamically adjust their behavior based on user intent (Ji et al., 2024). For example, multi-objective optimization algorithms can balance safety, utility, and legal compliance.
4.Ethicality
Requires AI systems to follow universal ethical principles in decision-making, for instance, avoiding discriminatory outputs or privacy violation (Ji et al., 2024).
The criteria for evaluating AI Alignment primarily include safety, utility, and legal compliance. Safety requires AI systems to avoid generating harmful or misleading content; utility emphasizes that AI systems must meet user needs; and legal compliance requires AI outputs to conform to social norms and regulatory requirements. For example, the EU's Artificial Intelligence Act imposes clear requirements on the transparency and interpretability of high-risk AI systems (European Union, 2021). The impact of AI Alignment is profound and multidimensional. At the technical level, it has driven the development of dynamic alignment technologies - RL, with Reinforcement Learning from Human Feedback (RLHF) becoming the mainstream alignment paradigm. AI Alignment is not just a technical issue but a social proposition crucial to the future development of humanity.
Technical Background: Achievements and Limitations of SFT
Supervised Fine-Tuning (SFT) is a widely used post-training technique in machine learning. By fine-tuning pre-trained models on annotated data for specific tasks, SFT significantly improves model performance in areas such as instruction following and logical reasoning (Howard & Ruder, 2018). For example, the LLaMA model, fine-tuned on a mix of mathematical reasoning datasets (GSM8K) and conversational data (ShareGPT), achieved near-human performance in solving mathematical problems and multi-turn dialogues.
However, while Supervised Fine-Tuning (SFT) has proven effective in aligning language models with specific tasks, its limitations in dynamic value alignment scenarios are well-documented in academic research.
These shortcomings stem from its reliance on static, pre-annotated datasets and its inability to adapt to evolving ethical frameworks or unforeseen scenarios:
1. Data Bias and Static Adaptation
SFT inherently inherits and amplifies biases present in training data, as it lacks mechanisms for dynamic correction. For instance, studies have shown that models like GPT-3 exhibit persistent gender and racial stereotypes due to biases embedded in their fine-tuning datasets. Bender et al. (2021) highlighted that language models trained on web-scale corpora systematically replicate societal biases, such as associating "nurse" with female pronouns and "engineer" with male pronouns, even after SFT. Training data for language models is usually extracted from the Internet, which tends to over-represent the mainstream viewpoints and ignore the voices of marginalized groups. This data selection problem may be further amplified during the fine-tuning phase, as fine-tuning often relies on domain-specific data that may be equally biased. If biased data are used during fine-tuning, the model may exhibit stronger biases on specific tasks. This static nature of SFT fails to address emerging biases or adapt to shifting societal norms, as seen in cases where models trained on historical data perpetuate outdated stereotypes.
2. Insufficient Generalization in Open-Ended Ethical Scenarios
SFT struggles with ethical dilemmas not explicitly covered in training data, particularly in high-stakes domains. SFT-based models often produce rigid or inconsistent decisions.
3. High Cost of Value Updates
The need for full data re-annotation results in long model iteration cycles.
These limitations stem from SFT's unidirectional conditional probability learning paradigm, where models only learn the probability distribution of "correct responses" without optimizing decision boundaries through error samples. Although researchers have attempted to mitigate these issues through data augmentation or bidirectional attention mechanisms, their effectiveness is constrained by SFT's inherent framework. Therefore, while SFT excels in optimizing specific tasks, it falls short in supporting dynamic, open-ended value alignment needs, necessitating a shift toward RL-driven alignment paradigms.
Breakthrough Advantages of RL in AI Alignment
Reinforcement Learning (RL)-driven alignment paradigms, through dynamic feedback mechanisms and multi-objective optimization capabilities, overcome the static learning limitations of SFT, emerging as a key technical path for solving dynamic AI value calibration problems. DeepSeek development team directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step (DeepSeek-AI, 2025). This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero (DeepSeek-AI, 2025). DeepSeek-R1-Zero demonstrates capabilities like self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community (DeepSeek-AI, 2025).
● RL Generates High-Quality Training Data for SFT:
In the iterative training of DeepSeek-R1, RL is used to optimize the model's reasoning capabilities and generate high-quality reasoning trajectories. These data, filtered through Rejection Sampling, become key training samples for subsequent SFT stages.
1. RL-SFT Synergy in Distillation
The RL-SFT Synergy in Distillation is a key highlight of the DeepSeek-R1 (2025) framework, demonstrating how the combination of Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) can effectively transfer reasoning capabilities from large models to smaller, more efficient ones. In this process, RL is first used to train the base model (DeepSeek-R1) on complex reasoning tasks, generating high-quality reasoning trajectories through mechanisms like rejection sampling. These trajectories, which include correct and well-structured reasoning steps, are then distilled into smaller models via SFT. For instance, the distilled model DeepSeek-R1-Distill-Qwen-32B achieves a pass@1 of 72.6% on AIME 2024, significantly outperforming the non-distilled QwQ-32B-Preview (50.0%) and even surpassing models trained with direct RL, DeepSeek-R1-Zero-Qwen-32B (47.0%) (Table 2) (DeepSeek-AI, 2025). This indicates that the reasoning patterns discovered by RL in larger models can be efficiently transferred to smaller models through SFT, reducing the need for computationally expensive RL training on smaller architectures.
Table 1 Comparison of DeepSeek-R1 distilled models and other comparable models on
reasoning-related benchmarks (Deepseek-AI, 2025)
Table 2 Comparison of distilled and RL Models on Reasoning-Related Benchmarks (DeepSeek-AI, 2025)
Furthermore, the distilled models, DeepSeek-R1-Distill-Llama-70B, achieve near-state-of-the-art performance on benchmarks like GPQA Diamond (94.5% pass@1) (Table 1), showcasing the effectiveness of this synergy (DeepSeek-AI, 2025). The success of this approach lies in the complementary roles of RL and SFT: RL explores and optimizes complex reasoning strategies, while SFT refines and generalizes these strategies for smaller models, making them both computationally efficient and highly capable. This synergy not only enhances the performance of smaller models but also provides a scalable and cost-effective method for deploying advanced reasoning capabilities in resource-constrained environments.
2. RL Compensates for the Generalization Limitations of SFT
SFT relies on human-annotated data, which may be limited by data coverage and static distributions. While SFT is effective in aligning models with specific tasks and human preferences, it often struggles with generalization across diverse and complex reasoning tasks. DeepSeek's experiments reveal that RL, particularly when applied at scale, can enhance the model's reasoning capabilities beyond what SFT alone can achieve.
To address the limitations of DeepSeek-R1-Zero (poor readability and language mixing), DeepSeek-R1 incorporated a small amount of cold-start data and a multi-stage training pipeline. This approach combined SFT with RL, resulting in a model that achieved performance on par with OpenAI's o1-1217 on reasoning tasks (DeepSeek-AI, 2025). The inclusion of SFT data helped improve readability and alignment with human preferences, while RL further enhanced the model's reasoning capabilities.
Experimental Comparison
DeepSeek-R1 achieves state-of-the-art performance with a pass@1 of 79.8% on AIME 2024 and 97.3% on MATH-500, validating the effectiveness of RL+SFT collaboration (Table 3) (DeepSeek-AI, 2025).
Table 3 Comparison between DeepSeek-R1 and other representative models
(DeepSeek-AI, 2025)
The DeepSeek paper (2025) illustrates that while SFT is crucial for aligning models with specific tasks and improving readability, RL compensates for SFT's limitations by enabling the model to develop advanced reasoning capabilities autonomously. The combination of SFT and RL in DeepSeek-R1 resulted in a model that not only performs well on reasoning tasks but also generalizes better across diverse domains. This suggests that RL is a powerful tool for enhancing the generalization and reasoning abilities of language models, especially when combined with SFT in a multi-stage training pipeline.
In the DeepSeek-R1 framework, RL and SFT form a complementary relationship:
Role of RL: Explores the solution space, generates high-quality data, and optimizes dynamic reasoning capabilities.
Role of SFT: Improves output readability, solidifies patterns learned through RL, and extends multi-task generalization.
DeepSeek's technology in RL allows the AI agent to gain the ability to repeat introspection and reflection to step by step produce responses that are more aligned with the user aspect. This collaborative mechanism provides a new paradigm for future AI training: RL explores potential, SFT optimizes performance. SFT relies on explicit rule encoding, while RL internalizes values as implicit decision logic through reward functions, enabling models to autonomously derive compliant responses based on context. For example, in ethical judgments involving cultural differences, RL models can achieve fine-grained value adaptation by decomposing reward functions into multiple levels (distinguishing universal ethics from regional norms). This shift from "mechanical obedience" to "value internalization" marks a philosophical transition in AI Alignment technology from instrumental rationality to value rationality.
Philosophical Dimension: The Ontological Shift in AI Alignment Technology
The mechanical obedience of SFT reflects the limitations of instrumental rationality. Habermas critiques instrumental rationality for neglecting value consensus, arguing that following technical rules alone cannot achieve comprehensive rationality (Habermas, 1984). Weber's theory of value rationality emphasizes that actions must internalize ethical beliefs rather than merely pursuing efficiency (Weber, 1922).
RL's contextualized value judgments resonate with Weber's value rationality. For example, in cultural difference scenarios, RL models simulate value uncertainty through Thompson sampling, enabling dynamic ethical decision-making (Russo et al., 2018). This shift from "rule compliance" to "value internalization" not only enhances AI's ethical sensitivity but also lays a philosophical foundation for constructing AI systems with moral agency.
Future Prospects: The Evolution of AI Alignment Technology
The future of AI alignment technology is poised for significant advancements through hybrid architectures, sociotechnical systems, and dynamic legal interfaces. The integration of Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Constitutional AI (CAI) will enable more comprehensive alignment, ensuring AI systems align with human values and intentions (Bai et al., 2022). Additionally, the development of cross-cultural value topology maps, for example Perspective api by Jigsaw and Google's Counter Abuse Technology team (Perspective API, 2021), will enhance AI's cultural sensitivity and social acceptance by addressing diverse ethical and societal norms. Furthermore, dynamic legal interfaces that automatically adapt to evolving AI regulations across different countries will ensure continuous alignment between AI systems and legal frameworks, fostering trust and compliance in a rapidly changing regulatory landscape. Together, these innovations will drive the evolution of AI alignment toward greater adaptability, inclusivity, and ethical responsibility.
References
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv.org.
Bender, E. M., Gebru, T., McMillan-Major, A., Shmitchell, S., Emily M. BenderUniversity of Washington, S., Timnit GebruBlack in AI, P. A., Angelina McMillan-MajorUniversity of Washington, S., & Profile, S. S. A. (2021). On the dangers of stochastic parrots: Proceedings of the 2021 ACM Conference on Fairness, accountability, and transparency. ACM Conferences.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. arXiv.org.
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., … Zhang, Z. (2025). Deepseek-R1: Incentivizing reasoning capability in LLMS via reinforcement learning. arXiv.org.
European Union. (2021). Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). EUR-Lex.
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
Habermas, J. (1984). The Theory of Communicative Action, Volume 1: Reason and the Rationalization of Society. Beacon Press.
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. ACL Anthology.
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X., O’Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., … Gao, W. (2024). AI alignment: A comprehensive survey. arXiv.org. https://arxiv.org/abs/2310.19852
Kim, Y., Son, K., Kim, S., & Kim., J. (2024). Beyond Prompts: Learning from Human Communication for Enhanced AI Intent Alignment. https://doi.org/10.48550/arXiv.2405.05678
Perspective API. (2021). Perspective API.
Roose, K. (2024). Can A.I Be Blamed for a Teen’s Suicide?. The New York Times. https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html
Russo, D., Van Roy, B., Kazerouni, A., & Osband, I. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning, 11(1), 1-96.
Weber, M. (1922). Economy and Society: An Outline of Interpretive Sociology. University of California Press.
About AladdinAGI
Your principal to the agentic web.
Website: https://aladdin.build/
YouTube: https://www.youtube.com/@AladdinAGI