Simo ALAMI CHEHBOUNE, doctorant au sein du projet PSE (Paris-Saclay Energies) de l’IRT SystemX, soutiendra sa thèse le 1 mars 2024, à 14h à Inria de Saclay – Bâtiment Alan Turing (Palaiseau), sur le sujet suivant : « Distributional Inverse Reinforcement Learning with Invertible Generative Models: Towards Transferable Reward Functions».
>> Assister à soutenance en ligne <<
Thesis abstract :
Humans possess a remarkable ability to quickly learn new concepts and adapt to unforeseen situations by drawing upon prior experiences and combining them with limited new evidence. Reinforcement Learning (RL) has proven effective in solving sequential decision-making problems in dynamic environments. However, unlike humans, learned policies are not efficiently transferable to different environments. Conversely, the reward function, representing the task’s essence, holds promise as a transferable representation. Unfortunately, obtaining an appropriate reward function for the task at hand is often challenging. Indeed translating human intent into mathematical functions to optimise is not straightforward and the slightest implementation error can lead to dramatic unexpected behaviours. This is called the AI alignment issue.Inverse Reinforcement Learning (IRL) attempts to learn a reward function from demonstrations, but there is no guarantee of transferability. The main hypothesis of this thesis is that learning reward functions that are transferable to multiple similar tasks could help mitigate the AI alignment issue getting us closer to algorithms that learn core concepts, akin to human reasonning.
In this thesis, we explore the potential of invertible generative models and a distributional perspective in RL as a step towards addressing these challenges. Firstly, we demonstrate how these models can facilitate learning a distribution of succeeding policies, each corresponding to different behaviors, while using the same reward function. Secondly, we highlight how these models enable learning distributions of returns for each state-action pair, moving beyond the sole focus on expected values. This approach proves advantageous for tackling IRL tasks, as we can learn the distribution of rewards for each state, interpreting the reward as a distance from the final state. Finally, we reveal that a distributional approach in conjunction with generative models allows for learning distance-based reward functions, which demonstrate transferability in single-step Markov Decision Processes (MDPs).This thesis offers insights into the potential synergy between distributional RL and invertible generative models, opening avenues for developing transferable reward functions and advancing the understanding of adaptability in RL.
Jury Composition :
- Sylvain LAMPRIER – Professeur, Université d’Angers – Rapporteur
- Aomar OSMANI – Maître de Conférenses HDR, Université Sorbonne Paris Nord – Rapporteur
- Mme Marie-Paule CANI – Professeure, Ecole Polytechnique – Examinatrice
- Mme Michèle SEBAG – Professeure, Université Paris-Saclay – Examinatrice
- Erwan LE PENNEC – Professeur, École polytechnique – Examinateur
- Fragkiskos MALLIAROS – Professeur assistant, CentraleSupélec – Examinateur
Laboratoire partenaire :
LIX (Laboratoire d’informatique de l’École polytechnique)