Skip to main content

AS Seminar: Policy learning via fully probabilistic design

Date
Room

In imitation learning (IL) an agent learns optimal policy from expert demonstrations. The classical approaches used for solution IL are behavioral cloning that learns a policy via solving a supervised learning problem, and inverse reinforcement learning. Applying the fully probabilistic design (FPD) formalism, we propose a new general approach for finding a stochastic policy from demonstration. The approach infers a policy directly from data without interaction with the expert or using any reinforcement signal. The expert’s actions generally need not to be optimal. The proposed approach learns an optimal policy by minimising Kullback Liebler divergence between probabilistic description of the actual agent-environment behaviour and distribution describing a targeted behaviour. We demonstrate our approach on simulated examples and show that the learned policy: i) converges to the optimal policy obtained by FPD; ii) achieves better performance than optimal FPD policy whenever a mismodelling is present.

Submitted by neuner on