zoukankan      html  css  js  c++  java
  • 强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle

    我们的目标是书写强化学习-Q learning的代码,然后利用代码解决汉诺塔问题

    强化学习简介

    基础的详细定义之类的,就不再这里赘述了。下面直接说一些有用的东西。

    强化学习的步骤:

    • 对于每个状态,对这个状态下,所有的动作,计算这个状态-动作的潜在奖励。

      • 一般记录在Q表格中,可以表示为 (Q[(state,move):value])
    • 对于汉诺塔问题,由于我们能达到最终的目标,所以这里设置最终的 reinforcement((r)) = 1

    • 对于强化学习,我们的选择动作有两种策略(注:同的选择所对应的更新Q表格的方程不同)

      • 一,每次选择最小的,更小的值,代表离目标更近。
      • 二,每次选择更大的,更大的值,代表离目标更近。
      • 这里我们设目标为1,同时使用更小值作为选择动作的方式。选择方程如下,
        • $ a_t^o = mathop{argmin}_{a} Q(s_t,a).$
        • 其中(a_t)为选择的动作,(s_t)为当前状态,可以解释为,(s_t)下,有若干的动作(a),选择Q最小的动作(a_t)
    • 现在考虑Q表格的更新问题

    • 对于Q表格的更新,我们采取下面两种方程。(r=1)(注意:这里我们会初始化所有的Q为0,接着再根据状态-动作进行更新)

      • 如果达到目标

        [ egin{align*} Q(s_t,a_t) = Q(s_t,a_t) + ho (r - Q(s_t,a_t)) end{align*} ]

        • 或者直接赋值为1,表示到达目标,这里为了计算简单,直接赋值为1。
      • 其他情况

        [egin{align*} Q(s_t,a_t) = Q(s_t,a_t) + ho (r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) end{align*} ]

      • 理解上述方程:

        • 首先,在(s_t)下,我们根据Q表格值,选取了动作(a_t),运动到(s_{t+1})
        • (s_{t+1})下,我们首先做的是更新上一个(s_t)下,动作(a_t)的Q值。
        • 这时,我们根据Q表格可以有(s_{t+1})下,(a_{t+1})的值,并且,我们有目标奖励reinforcement((r)) = 1
        • 这里,我们把((r + Q(s_{t+1},a_{t+1})))看作实际(s_t)下,动作(a_t)的Q值,同时估计值是(Q(s_t,a_t))
        • 因此,((r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))) 可以是估计值与实际值的差值,再乘以学习率( ho),表示每次学习的差值。
        • 最后,把差值累加上原有的估计值(Q(s_t,a_t)),即为更新后的(Q(s_t,a_t))

    以上为对于基本强化学习的解释。

    需要完成的事

    • 首先,我们要把汉诺塔问题可视化。方便观察运行结果,与过程。

    • 简单来看,我们可以用[[1, 2, 3], [], []]表示一个一个状态,三个小的[]表示三根塔柱,数字表示三个塔盘,其中大小表示塔盘的不同大小。

    • 对于移动塔盘的动作,也可以简化为单个[1, 2],或者(1, 2),表示为,把一号塔柱,上的塔盘移动到二号塔柱上,(从左到右依次1,2,3)

    • 那么我们可以书写一下四个方程:

      • printState(state): 打印塔的状态,便于可视化
      • validMoves(state): 返回当前 state下的所有的可行动作
      • makeMove(state, move): 返回根据move(action)移动后的state
      • stateMoveTuple(state, move): statemove(action)需要更改以为tuple格式,即(state,move),因为,这里我们把Q表格更改字典型存储,这样比较简单
    • 接下来书写epsilonDecayFactor方程

      • 此方程的功能为:随机一个数,如果这个数小于我们预设的epsilon,那么就随机一个动作。如果大于,就从Q表格中选择Q值最小的动作运动。
      • 对于epsilonGreedy 方程(If np.random.uniform()<epsilon)来说,小的epsilon意味着,更多可能会使用Q表格选取动作。太大的epsilon会导致无法收敛的问题。对于本次题目来说,加入(epsilon*=espsilonDecayFactor) 来不断减小epsilon的值,使其趋向于0。
    • trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState)

      • 根据start与goal状态,训练Q表格,得到一个合理的Q表格
      • 以下为trainQ的伪代码:
        1. 初始化 Q.
        2. Repeat:
          1. Use epsilonGreedy function to get the action and get the stateNew
          2. If (stateNew,move) not in Q,
          3. Update Qold = 0
          4. If stateNew is goalState,
            1. Update Qold = 1
          5. Otherwise (not at goal),
            1. If not first step, update Qold = Qold + rho * (1 + Qnew - Qold)
            2. Shift current state and action to old ones.
    • testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState)

      • 选择需要的start与goal状态,自动根据Q表格中的值,选择最优的移动策略

      • 以下为testQ的伪代码:

        1. get Q from the trainQ;

        2. Repeat:

          1. Use validMoves function to get the action list;
          2. Use the Q table to get value of ((state,action))

          ​ if the action is not in Q, set the value is infinity

          1. Choose the action by the (argmin Q[(state,move)])
          2. Record the action and state in path
          3. If at goal;

          ​ return path

          1. If step > maxStep;

          ​ return 'Goal not reached in maxSteps'

    Code & Test

    import numpy as np
    import random
    import matplotlib.pyplot as plt
    import copy
    %matplotlib inline
    
    def stateModify(state):
        N = 3
        row = []
        stateModify = []
        collums= len(state)
        stateCopy = copy.copy(state)
        for i in range(collums):
            row.append(len(state[i]))
        # add 0 in modified state
        for i in range (collums):
            while row[i] < N:
                stateCopy[i].insert(0,0)
                row[i]= len(stateCopy[i])    
        # set it as modify state
        for i in range(max(row)):
            for j in range(len(stateCopy)):
                stateModify.append(stateCopy[j][i])          
        return(stateModify)
    
    def printState(state):
        statePrint = stateModify(state)
        # print the state 
        i = 0
        for num in statePrint:
            # if the number is zero, we print ' '
            if num == 0:
                print(" ",end=" ")
            else:
                print(num, end=" ")
            i += 1
            if i%3 == 0:
                print("")
        print('------')
    
    def validMoves(state):
        actions = []    
        # check left 
        if state[0] != []:
            # left to middle
            if state[1]==[] or state[0][0] < state[1][0]:
                actions.append([1,2])
            # left to right
            if state[2]==[] or state[0][0] < state[2][0]:
                actions.append([1,3])
       
        # check middle
        if state[1] != []:
            # middle to left
            if state[0]==[] or state[1][0] < state[0][0]:
                actions.append([2,1])
            # middle to right   
            if state[2]==[] or state[1][0] < state[2][0]:
                actions.append([2,3])
        
        # check right        
        if state[2] != []:
            # right to left
            if state[0]==[] or state[2][0] < state[0][0]:
                actions.append([3,1])
            # right to middle
            if state[1]==[] or state[2][0] < state[1][0]:
                actions.append([3,2])            
        return actions
    
    def stateMoveTuple(state, move):
        stateTuple = []
        returnTuple = [tuple(move)]
        for i in range (len(state)):
            stateTuple.append(tuple(state[i]))
        returnTuple.insert(0,tuple(stateTuple))
        return tuple(returnTuple)
    
    def makeMove(state, move):
        stateMove = []
        stateMove = copy.deepcopy(state)
        
        stateMove[move[1]-1].insert(0,stateMove[move[0]-1][0])
        stateMove[move[0]-1].pop(0)
        return stateMove
    
    def epsilonGreedy(Q, state, epsilon, validMovesF):
        validMoveList = validMoves(state)
        if np.random.uniform() < epsilon:
            # Random Move
            lens = len(validMoveList)
            return validMoveList[random.randint(0,lens-1)]
        else:
            # Greedy Move
            Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList]) 
            return validMoveList[np.argmin(Qs)]
    
    def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState):
        epsilon = 1.0
        outcomes = np.zeros(nRepetitions)
        Q = {}
        for nGames in range(nRepetitions):
            epsilon *= epsilonDecayFactor
            step = 0
            done = False
            state = copy.deepcopy(startState)
        
            while not done:
                step += 1
                move = epsilonGreedy(Q, state, epsilon, validMovesF)         
                stateNew = makeMoveF(state,move)
                if stateMoveTuple(state, move) not in Q:
                    Q[stateMoveTuple(state, move)] = 0 
                    
                if stateNew == goalState:
    #                 Q[stateMoveTuple(state, move)] += learningRate * (1 - Q[stateMoveTuple(state, move)])
                    Q[stateMoveTuple(state, move)] = 1
                    done = True
                    outcomes[nGames] = step  
                    
                else:
                    if step > 1:
                        Q[stateMoveTuple(stateOld, moveOld)] += learningRate * 
                                        (1 + Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)]) 
                    stateOld = copy.deepcopy(state)
                    moveOld = copy.deepcopy(move)
                    state = copy.deepcopy(stateNew)
        return Q, outcomes                  
    
    def testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState):
        state = copy.copy(startState)
        epsilon = 1.0
        path = []
        path.append(state)
        done = False
        step = 0 
        while not done:
            step += 1 
            Qs = []
            validMoveList = validMoves(state)
            for m in validMoveList:
                if stateMoveTuple(state, m) in Q:
                    Qs.append(Q[stateMoveTuple(state, m)])
                else:
                    Qs.append(0xffffff)
            stateNew = makeMoveF(state,validMoveList[np.argmin(Qs)])
            path.append(stateNew)
            if stateNew == goalState:
                return path
                done = True
            elif step >=maxSteps:
                print('Goal not reached in {} steps'.format(maxSteps))
                return []
                done = True
            state = copy.deepcopy(stateNew)   
    
    
    def minsteps(steps,minStepOld,nRepetitions):
        delStep =0
    
        steps = list(steps)
    #     lengh = len(step)
        while delStep != nRepetitions:
            if np.mean(steps)>7:
                steps.pop(0)
                delStep += 1
            else:
                if delStep < minStepOld:
                    return delStep,True
                else:
                    return minStepOld,False
        if delStep < minStepOld:
            return delStep,True
        else:
            return minStepOld,False
    
    
    def findBetter(nRepetitions,learningRate,epsilonDecayFactor):
        Q, steps = trainQ(nRepetitions, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])       
        minStepOld,_ = minsteps(steps,0xffffff,50)
        bestlRate = 0.5
        besteFactor = 0.7
        LAndE = []
        for k in range(10):
            for i in range(len(learningRate)):
                for j in range(len(epsilonDecayFactor)):
                    Q, steps = trainQ(nRepetitions, learningRate[i], epsilonDecayFactor[j], validMoves, makeMove,
                                      startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
                    minStepNew,B = minsteps(steps,minStepOld,nRepetitions)
                    if B:
                        bestlRate = learningRate[i]
                        besteFactor = epsilonDecayFactor[j]
                        minStepOld = copy.deepcopy(minStepNew)
            LAndE.append([bestlRate,besteFactor])            
        return LAndE
    
    

    Test part

    state = [[1, 2, 3], [], []]
    printState(state)
    
    
    1     
    2     
    3     
    ------
    
    
    state = [[1, 2, 3], [], []]
    move =[1, 2]
    stateMoveTuple(state, move)
    
    
    (((1, 2, 3), (), ()), (1, 2))
    
    
    state = [[1, 2, 3], [], []]
    newstate = makeMove(state, move)
    newstate
    
    
    [[2, 3], [1], []]
    
    
    Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
    path = testQ(Q, 20, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
    path
    
    
    [[[1, 2, 3], [], []],
     [[2, 3], [], [1]],
     [[3], [2], [1]],
     [[3], [1, 2], []],
     [[], [1, 2], [3]],
     [[1], [2], [3]],
     [[1], [], [2, 3]],
     [[], [], [1, 2, 3]]]
    
    
    for s in path:
        printState(s)
        print()
    
    
    1     
    2     
    3     
    ------
    
    2     
    3   1 
    ------
    
    3 2 1 
    ------
    
      1   
    3 2   
    ------
    
      1   
      2 3 
    ------
    
    1 2 3 
    ------   
    
        2 
    1   3 
    ------
    
    	1 
    	2 
    	3 
    ------
    
    

    # find better learningRate and epsilonDecayFactor
    learningRate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    epsilonDecayFactor = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    LAndE = findBetter(100,learningRate,epsilonDecayFactor)
    print(LAndE)
    
    
    [[0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6]]
    
    
  • 相关阅读:
    LeetCode 123. Best Time to Buy and Sell Stock III (stock problem)
    精帖转载(关于stock problem)
    LeetCode 122. Best Time to Buy and Sell Stock II (stock problem)
    LeetCode 121. Best Time to Buy and Sell Stock (stock problem)
    LeetCode 120. Triangle
    基于docker 搭建Elasticsearch5.6.4 分布式集群
    从零开始构建一个centos+jdk7+tomcat7的docker镜像文件
    Harbor实现容器镜像仓库的管理和运维
    docker中制作自己的JDK+tomcat镜像
    docker镜像制作---jdk7+tomcat7基础镜像
  • 原文地址:https://www.cnblogs.com/shenggang/p/12133265.html
Copyright © 2011-2022 走看看