LLM Tech Reports Qwen3 Technical Report Paper β’ 2505.09388 β’ Published May 14, 2025 β’ 337 Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126 Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126
Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
RLHF Papers Proximal Policy Optimization Algorithms Paper β’ 1707.06347 β’ Published Jul 20, 2017 β’ 11 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 141 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 440
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 141
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 440
LLM Tech Reports Qwen3 Technical Report Paper β’ 2505.09388 β’ Published May 14, 2025 β’ 337 Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126 Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22, 2025 β’ 126
Training language models to follow instructions with human feedback Paper β’ 2203.02155 β’ Published Mar 4, 2022 β’ 24
RLHF Papers Proximal Policy Optimization Algorithms Paper β’ 1707.06347 β’ Published Jul 20, 2017 β’ 11 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 141 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 440
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper β’ 2305.18290 β’ Published May 29, 2023 β’ 64
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models Paper β’ 2402.03300 β’ Published Feb 5, 2024 β’ 141
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper β’ 2501.12948 β’ Published Jan 22, 2025 β’ 440