Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Poster at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Bar-Ilan University

Abstract

Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.

SUFT Achieves Substantial Reward Improvements

Mean reward ratio comparison between agents using the additional SUFT term and the baseline agents without it across 57 Atari games and five MuJoCo environments, highlighting the profound reward gains across diverse agents and domains.

Mean reward ratio comparison between agents using the additional SUFT term and the baseline agents without it across 57 Atari games and five MuJoCo environments, highlighting the profound reward gains across diverse agents and domains.

Improve Sample Efficiency at a Negligible Cost

+437.05%
Mean reward ratio improvement.
-96%
Reduction in resource demands.
x25
Smaller replay buffer size.

SUFT leads to significant improvements in convergence rates and reward outcomes, achieving a 437.05% mean reward ratio gain while using 25 times smaller replay buffer size. This reflects in a 96% reduction in resource demands. These results underscore the potential of our approach to obtain high-performance DRL agents with significantly smaller buffers, paving the way for more computationally efficient and accessible DRL methods by enhancing sample efficiency at a negligible cost.

SUFT Causal Bound

We establish a causal upper bound on the unobserved factual loss using the observed counterfactual loss and the estimated treatment effect:

\[ \epsilon_{F_{\phi}} \leq \epsilon_{CF_{\phi}} + \psi_{\phi} + \delta \]

The interpretation of this causal upper bound in the DRL framework establishes an upper bound to the on-policy loss, by the standard off-policy loss and an additional SUFT off-policy evaluation term (OPE):

\[ \epsilon_{\text{On-Policy}_Q} \leq \epsilon_{\text{Off-Policy}_Q} + \psi_{\text{SUFT}_Q} + \delta \]

We are optimizing the SUFT causal bound on the loss, yielding the following SUFT objective term:

\[ \epsilon_{\text{Off-Policy}_Q} + \lambda_\text{TF}\cdot \psi_{\text{SUFT}_{Q}} \]

In value-based methods such as DQN agents, the SUFT OPE term is seamlessly added to the standard loss function, while in Actor-Critic architectures, it is incorporated into the critic's loss. To compute the SUFT OPE term, we store the behavior policy value network outputs. This result in an experience tuple of the form \(\left(s, a, r, s', Q(s, a;\theta_\text{behavior})\right)\), instead of the standard \(\left(s, a, r, s'\right)\)

Causal Bound Theorem

We formalize an upper bound in the causal framework and integrate it into the DRL framework as an upper bound for the on-policy loss.

Recycling Data

Our approach effectively transforms overlooked data into valuable causal insights, metaphorically turning sand data into gold.

Universal Implementation

Our method is applicable to any DRL agent with a V or Q-value network that influences its policy.

To the best of our knowledge, we are the first to introduce an upper bound on the factual loss within the causal framework and adapt it to the DRL setting, thereby bounding the on-policy loss.

Massive Performance Boost

Log-scaled reward improvements comparison between agents using the additional SUFT term and the baseline agents without it across 57 Atari games. The results demonstrate the superior performance of our method across the majority of games. The red line indicates a 10% improvement, and the green line represents a 100% improvement. Left: Double DQN SUFT outperforms the baseline agent in 35 out of the 40 valid games; Right: PPO SUFT outperforms the baseline agent in 39 out of the 42 valid games.

Log-scaled reward improvements comparison between agents using the additional SUFT term and the baseline agents without it across 57 Atari games. The red line indicates a 10% improvement, and the green line represents a 100% improvement. Left: Double DQN; Right: PPO.

The results demonstrate the superior performance of our method across the majority of games. Double DQN SUFT outperforms the baseline agent in 35 out of the 40 valid games. PPO SUFT outperforms the baseline agent in 39 out of the 42 valid games.

Outperforming an Average Human

Learning curves comparison between agents using the SUFT OPE term and the baseline agents without it across selected Atari games. The red line indicates human-level performance, showing that SUFT not only surpasses the baseline agent but can even exceed human rewards. Top: Double DQN; Bottom: PPO.

Learning curves comparison between agents using the SUFT OPE term and the baseline agents without it across selected Atari games. The red line indicates human-level performance, showing that SUFT not only surpasses the baseline agent but can even exceed human rewards.
Top: Double DQN; Bottom: PPO.

In several Atari games, SUFT not only surpasses the baseline agent but also achieves higher rewards than an average human. This achievement highlights the potential of our method to boost agents' performance even above the human level, while keeping constrained environment interactions and resource demands.

Conclusion

This paper presents SUFT, a causal upper-bound loss optimization method that enhances DRL sample efficiency and reduces computational demands at a negligible cost. We begin by establishing a provable bound on the factual loss within the Neyman-Rubin potential outcomes framework and seamlessly adapting it to the DRL framework, showing that on-policy loss is bounded by every agent's standard off-policy loss and an additional SUFT OPE term. This term is computable by storing past value network outputs in the experience replay buffer, thereby reusing discarded data to improve agents' performance. Experimental results across diverse environments and agents show that our method leads to significant improvements in convergence rates and reward outcomes.

Citation

@inproceedings{fiskus2025turning,
  title={Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound},
  author={Tal Fiskus and Uri Shaham},
  journal={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=5UtsjOGsDx}
}