zoukankan      html  css  js  c++  java
  • 强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle

    我们的目标是书写强化学习-Q learning的代码,然后利用代码解决汉诺塔问题

    强化学习简介

    基础的详细定义之类的,就不再这里赘述了。下面直接说一些有用的东西。

    强化学习的步骤:

    • 对于每个状态,对这个状态下,所有的动作,计算这个状态-动作的潜在奖励。

      • 一般记录在Q表格中,可以表示为 (Q[(state,move):value])
    • 对于汉诺塔问题,由于我们能达到最终的目标,所以这里设置最终的 reinforcement((r)) = 1

    • 对于强化学习,我们的选择动作有两种策略(注:同的选择所对应的更新Q表格的方程不同)

      • 一,每次选择最小的,更小的值,代表离目标更近。
      • 二,每次选择更大的,更大的值,代表离目标更近。
      • 这里我们设目标为1,同时使用更小值作为选择动作的方式。选择方程如下,
        • $ a_t^o = mathop{argmin}_{a} Q(s_t,a).$
        • 其中(a_t)为选择的动作,(s_t)为当前状态,可以解释为,(s_t)下,有若干的动作(a),选择Q最小的动作(a_t)
    • 现在考虑Q表格的更新问题

    • 对于Q表格的更新,我们采取下面两种方程。(r=1)(注意:这里我们会初始化所有的Q为0,接着再根据状态-动作进行更新)

      • 如果达到目标

        [ egin{align*} Q(s_t,a_t) = Q(s_t,a_t) + ho (r - Q(s_t,a_t)) end{align*} ]

        • 或者直接赋值为1,表示到达目标,这里为了计算简单,直接赋值为1。
      • 其他情况

        [egin{align*} Q(s_t,a_t) = Q(s_t,a_t) + ho (r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) end{align*} ]

      • 理解上述方程:

        • 首先,在(s_t)下,我们根据Q表格值,选取了动作(a_t),运动到(s_{t+1})
        • (s_{t+1})下,我们首先做的是更新上一个(s_t)下,动作(a_t)的Q值。
        • 这时,我们根据Q表格可以有(s_{t+1})下,(a_{t+1})的值,并且,我们有目标奖励reinforcement((r)) = 1
        • 这里,我们把((r + Q(s_{t+1},a_{t+1})))看作实际(s_t)下,动作(a_t)的Q值,同时估计值是(Q(s_t,a_t))
        • 因此,((r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))) 可以是估计值与实际值的差值,再乘以学习率( ho),表示每次学习的差值。
        • 最后,把差值累加上原有的估计值(Q(s_t,a_t)),即为更新后的(Q(s_t,a_t))

    以上为对于基本强化学习的解释。

    需要完成的事

    • 首先,我们要把汉诺塔问题可视化。方便观察运行结果,与过程。

    • 简单来看,我们可以用[[1, 2, 3], [], []]表示一个一个状态,三个小的[]表示三根塔柱,数字表示三个塔盘,其中大小表示塔盘的不同大小。

    • 对于移动塔盘的动作,也可以简化为单个[1, 2],或者(1, 2),表示为,把一号塔柱,上的塔盘移动到二号塔柱上,(从左到右依次1,2,3)

    • 那么我们可以书写一下四个方程:

      • printState(state): 打印塔的状态,便于可视化
      • validMoves(state): 返回当前 state下的所有的可行动作
      • makeMove(state, move): 返回根据move(action)移动后的state
      • stateMoveTuple(state, move): statemove(action)需要更改以为tuple格式,即(state,move),因为,这里我们把Q表格更改字典型存储,这样比较简单
    • 接下来书写epsilonDecayFactor方程

      • 此方程的功能为:随机一个数,如果这个数小于我们预设的epsilon,那么就随机一个动作。如果大于,就从Q表格中选择Q值最小的动作运动。
      • 对于epsilonGreedy 方程(If np.random.uniform()<epsilon)来说,小的epsilon意味着,更多可能会使用Q表格选取动作。太大的epsilon会导致无法收敛的问题。对于本次题目来说,加入(epsilon*=espsilonDecayFactor) 来不断减小epsilon的值,使其趋向于0。
    • trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState)

      • 根据start与goal状态,训练Q表格,得到一个合理的Q表格
      • 以下为trainQ的伪代码:
        1. 初始化 Q.
        2. Repeat:
          1. Use epsilonGreedy function to get the action and get the stateNew
          2. If (stateNew,move) not in Q,
          3. Update Qold = 0
          4. If stateNew is goalState,
            1. Update Qold = 1
          5. Otherwise (not at goal),
            1. If not first step, update Qold = Qold + rho * (1 + Qnew - Qold)
            2. Shift current state and action to old ones.
    • testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState)

      • 选择需要的start与goal状态,自动根据Q表格中的值,选择最优的移动策略

      • 以下为testQ的伪代码:

        1. get Q from the trainQ;

        2. Repeat:

          1. Use validMoves function to get the action list;
          2. Use the Q table to get value of ((state,action))

          ​ if the action is not in Q, set the value is infinity

          1. Choose the action by the (argmin Q[(state,move)])
          2. Record the action and state in path
          3. If at goal;

          ​ return path

          1. If step > maxStep;

          ​ return 'Goal not reached in maxSteps'

    Code & Test

    import numpy as np
    import random
    import matplotlib.pyplot as plt
    import copy
    %matplotlib inline
    
    def stateModify(state):
        N = 3
        row = []
        stateModify = []
        collums= len(state)
        stateCopy = copy.copy(state)
        for i in range(collums):
            row.append(len(state[i]))
        # add 0 in modified state
        for i in range (collums):
            while row[i] < N:
                stateCopy[i].insert(0,0)
                row[i]= len(stateCopy[i])    
        # set it as modify state
        for i in range(max(row)):
            for j in range(len(stateCopy)):
                stateModify.append(stateCopy[j][i])          
        return(stateModify)
    
    def printState(state):
        statePrint = stateModify(state)
        # print the state 
        i = 0
        for num in statePrint:
            # if the number is zero, we print ' '
            if num == 0:
                print(" ",end=" ")
            else:
                print(num, end=" ")
            i += 1
            if i%3 == 0:
                print("")
        print('------')
    
    def validMoves(state):
        actions = []    
        # check left 
        if state[0] != []:
            # left to middle
            if state[1]==[] or state[0][0] < state[1][0]:
                actions.append([1,2])
            # left to right
            if state[2]==[] or state[0][0] < state[2][0]:
                actions.append([1,3])
       
        # check middle
        if state[1] != []:
            # middle to left
            if state[0]==[] or state[1][0] < state[0][0]:
                actions.append([2,1])
            # middle to right   
            if state[2]==[] or state[1][0] < state[2][0]:
                actions.append([2,3])
        
        # check right        
        if state[2] != []:
            # right to left
            if state[0]==[] or state[2][0] < state[0][0]:
                actions.append([3,1])
            # right to middle
            if state[1]==[] or state[2][0] < state[1][0]:
                actions.append([3,2])            
        return actions
    
    def stateMoveTuple(state, move):
        stateTuple = []
        returnTuple = [tuple(move)]
        for i in range (len(state)):
            stateTuple.append(tuple(state[i]))
        returnTuple.insert(0,tuple(stateTuple))
        return tuple(returnTuple)
    
    def makeMove(state, move):
        stateMove = []
        stateMove = copy.deepcopy(state)
        
        stateMove[move[1]-1].insert(0,stateMove[move[0]-1][0])
        stateMove[move[0]-1].pop(0)
        return stateMove
    
    def epsilonGreedy(Q, state, epsilon, validMovesF):
        validMoveList = validMoves(state)
        if np.random.uniform() < epsilon:
            # Random Move
            lens = len(validMoveList)
            return validMoveList[random.randint(0,lens-1)]
        else:
            # Greedy Move
            Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList]) 
            return validMoveList[np.argmin(Qs)]
    
    def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState):
        epsilon = 1.0
        outcomes = np.zeros(nRepetitions)
        Q = {}
        for nGames in range(nRepetitions):
            epsilon *= epsilonDecayFactor
            step = 0
            done = False
            state = copy.deepcopy(startState)
        
            while not done:
                step += 1
                move = epsilonGreedy(Q, state, epsilon, validMovesF)         
                stateNew = makeMoveF(state,move)
                if stateMoveTuple(state, move) not in Q:
                    Q[stateMoveTuple(state, move)] = 0 
                    
                if stateNew == goalState:
    #                 Q[stateMoveTuple(state, move)] += learningRate * (1 - Q[stateMoveTuple(state, move)])
                    Q[stateMoveTuple(state, move)] = 1
                    done = True
                    outcomes[nGames] = step  
                    
                else:
                    if step > 1:
                        Q[stateMoveTuple(stateOld, moveOld)] += learningRate * 
                                        (1 + Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)]) 
                    stateOld = copy.deepcopy(state)
                    moveOld = copy.deepcopy(move)
                    state = copy.deepcopy(stateNew)
        return Q, outcomes                  
    
    def testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState):
        state = copy.copy(startState)
        epsilon = 1.0
        path = []
        path.append(state)
        done = False
        step = 0 
        while not done:
            step += 1 
            Qs = []
            validMoveList = validMoves(state)
            for m in validMoveList:
                if stateMoveTuple(state, m) in Q:
                    Qs.append(Q[stateMoveTuple(state, m)])
                else:
                    Qs.append(0xffffff)
            stateNew = makeMoveF(state,validMoveList[np.argmin(Qs)])
            path.append(stateNew)
            if stateNew == goalState:
                return path
                done = True
            elif step >=maxSteps:
                print('Goal not reached in {} steps'.format(maxSteps))
                return []
                done = True
            state = copy.deepcopy(stateNew)   
    
    
    def minsteps(steps,minStepOld,nRepetitions):
        delStep =0
    
        steps = list(steps)
    #     lengh = len(step)
        while delStep != nRepetitions:
            if np.mean(steps)>7:
                steps.pop(0)
                delStep += 1
            else:
                if delStep < minStepOld:
                    return delStep,True
                else:
                    return minStepOld,False
        if delStep < minStepOld:
            return delStep,True
        else:
            return minStepOld,False
    
    
    def findBetter(nRepetitions,learningRate,epsilonDecayFactor):
        Q, steps = trainQ(nRepetitions, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])       
        minStepOld,_ = minsteps(steps,0xffffff,50)
        bestlRate = 0.5
        besteFactor = 0.7
        LAndE = []
        for k in range(10):
            for i in range(len(learningRate)):
                for j in range(len(epsilonDecayFactor)):
                    Q, steps = trainQ(nRepetitions, learningRate[i], epsilonDecayFactor[j], validMoves, makeMove,
                                      startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
                    minStepNew,B = minsteps(steps,minStepOld,nRepetitions)
                    if B:
                        bestlRate = learningRate[i]
                        besteFactor = epsilonDecayFactor[j]
                        minStepOld = copy.deepcopy(minStepNew)
            LAndE.append([bestlRate,besteFactor])            
        return LAndE
    
    

    Test part

    state = [[1, 2, 3], [], []]
    printState(state)
    
    
    1     
    2     
    3     
    ------
    
    
    state = [[1, 2, 3], [], []]
    move =[1, 2]
    stateMoveTuple(state, move)
    
    
    (((1, 2, 3), (), ()), (1, 2))
    
    
    state = [[1, 2, 3], [], []]
    newstate = makeMove(state, move)
    newstate
    
    
    [[2, 3], [1], []]
    
    
    Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
    path = testQ(Q, 20, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
    path
    
    
    [[[1, 2, 3], [], []],
     [[2, 3], [], [1]],
     [[3], [2], [1]],
     [[3], [1, 2], []],
     [[], [1, 2], [3]],
     [[1], [2], [3]],
     [[1], [], [2, 3]],
     [[], [], [1, 2, 3]]]
    
    
    for s in path:
        printState(s)
        print()
    
    
    1     
    2     
    3     
    ------
    
    2     
    3   1 
    ------
    
    3 2 1 
    ------
    
      1   
    3 2   
    ------
    
      1   
      2 3 
    ------
    
    1 2 3 
    ------   
    
        2 
    1   3 
    ------
    
    	1 
    	2 
    	3 
    ------
    
    

    # find better learningRate and epsilonDecayFactor
    learningRate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    epsilonDecayFactor = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    LAndE = findBetter(100,learningRate,epsilonDecayFactor)
    print(LAndE)
    
    
    [[0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6]]
    
    
  • 相关阅读:
    软件测试——C#判断密码是否符合要求
    软件测试——C#判断闰年的form小程序
    初识JUnit
    软件测试的方法一共有几种
    多个异步请求调用一个回调函数
    单元测试、集成测试、系统测试总结
    软件测试同行评审流程
    白盒测试总结
    黑盒测试总结
    闰年测试
  • 原文地址:https://www.cnblogs.com/shenggang/p/12133265.html
Copyright © 2011-2022 走看看