make compromise between learnt policy and minimal cost!
π hat is using states
π theta is using observations