zoukankan      html  css  js  c++  java
  • SML_Assignment2

    Task 1: Optimization

    1a) Numerical Optimization

    Derivative of the function
    ( frac{partial f}{partial x_j} = sum_{i=1}^N 200(x_i - x_{i-1}^2)(delta_{i,j} - 2x_{i-1}delta_{i-1,j}) - 2(1 - x_{i-1})delta_{i-1,j} \ = 200(x_j - x_{j-1}^2) - 400x_j(x_{j+1} - x_j^2) - 2(1 - x_j) )
    but the first and the last x are exception
    (frac{partial f}{partial x_0} = -400x_0(x_1 - x_0^2) - 2(1 - x_0) \ frac{partial f}{partial x_{N-1}} = 200(x_{N-1} - x_{N-2}^2) )
    This is complete code:

    import numpy as np
    import matplotlib.pyplot as plt
    
    # Derivative of the function
    def derrivation_rosenbrock(x):
        xm = x[1:-1]
        xm_m1 = x[:-2]
        xm_p1 = x[2:]
        der = np.zeros_like(x)
        der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
        der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
        der[-1] = 200*(x[-1]-x[-2]**2)
        return der
    
    
    def main():
        # set the learning rate first
        # this learning set is ad hoc corresponds to objective function
        lrate = 0.002
        # initialize a point
        x = np.array([-0.5, 0.2,-0.3, 0.1,-0.4, 0.5,-0.1, 0.1,-0.3, 0.4,-0.1, 0.2,-0.5, 0.4,-0.2, 0.4,-0.3, 0.2,-0.3, 0.4])
        # set number of steps (can be changed as you want)
        steps = 10000
        # we have to record all tuples of the points and its function
        ai = []
        n = 20
        for i in range(steps):
            sum = 0
            for j in range(n-1):
                f = (100 * ((x[j+1] - x[j] ** 2) ** 2) + (x[j] - 1) ** 2)
                sum += f
            # sum = rosen(x)
            # append the point and its obj function to ai as 1D list
            ai.append([x, sum])
            # compute its derrivative on point a
            fi = np.array(derrivation_rosenbrock(x))
            # set the new point based on its
            x = x - np.dot(lrate, fi)
    
        # convert ai into a numpy array
        ai = np.array(ai)
        # print the last 10 of ai
        print(ai[-10:-1])
        # the minimum value of the function is just the last element of ai
        print(f'the minimum is: {ai[-1, 1]} at point: {ai[-1,0]}')
    
    
        # Plot argmin parameter
        x = np.arange(0, 10000, 1)
        plt.plot(x, ai[:,1])
        plt.xlabel("x axis label")
        plt.ylabel("y axis label")
        plt.title("Argmin")
        plt.legend(["Parameter"])
        plt.show()
    
    if __name__ == '__main__':
        main()
    

    1b) Gradient Descent Variants

    (1)
    vanilla:
    pros:
    For convex optimization problems, the algorithm will converge to global optima. In the case of non-convex questions, the final result will converge to local optimum.
    cons:
    It can converge at local minima and saddle points.

    stochastic:
    pros:
    The benefit of SGD is that it greatly reduces the time complexity. However, due to the random selection of a sample each time, it undoubtedly increases the randomness in the iteration process. Therefore, we can obviously observe the fluctuation in the graph of the function value on the number of iterations.
    cons:
    Can veer off in the wrong direction due to frequent updates.

    mini-batch:
    pros:
    On the one hand, we have greatly reduced the number of iterations and increased the efficiency; on the other hand, we have maintained some good properties of SGD, that is, we have reduced the Time Complexity when calculating the gradient and achieved better convergence performance.
    cons:
    But we have to configure the Mini-Batch size hyperparameter.
    (2)
    The momentum is an extension of the traditional gradient descent method (SGD), which is more efficient than SGD. Momentum method, also known as SGD with Momentum, is a method to accelerate gradient vectors to the relevant direction and finally achieve accelerated convergence. Momentum method is a very popular optimization algorithm and is used in many models today.

    Using the stochastic gradient descent method, we will not calculate the exact derivative of the loss function. Instead, we estimate from a small set of data. This means that we are not always heading in the best direction because the results we get are "noisy". Therefore, a weighted average of the exponents can provide a better estimate that is closer to the derivative of the actual value than would be obtained by noisy computation. This is one reason why the momentum method may be better than traditional SGD.

    Task 2: Density Estimation

    2a) Maximization Likelihood Estimate of the Exponential Distribution

    The probability density function of the exponential distribution is defined as:
    ( f(n) = egin{cases} frac{1}{s}exp(-frac{x}{s}) & ext{if }x geq 0 \ 0 & ext{if }x < 0 end{cases} )
    Its likelihood function is:
    ( L(lambda, x_1, x_2,cdots, x_n) = prod_{i=1}^n p(x|s) = prod_{i=1}^n frac{1}{s}exp(-frac{x}{s}) = frac{1}{s^n}exp(-frac{1}{s}sum_{i=1}^n x_i) )
    To calculate the MLE we need to solve the equation
    (frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = 0 \ frac{dln(L(lambda, x_1, x_2, cdots, x_n))}{dlambda} = frac{dln(lambda^n e^{-lambda} sum_{i=1}^n x_i)}{dlambda} \ = frac{d(nln(lambda) - lambdasum_{i=1}^n x_i)}{d lambda} \ = frac{n}{lambda} - sum_{i=1}^n x_i )
    becuase (lambda = frac{1}{s} Rightarrow \ frac{n}{lambda} - sum_{i=1}^n x_i = sn - sum_{i=1}^n x_i)

    Task 3: Density Estimation

    3a) Prior Probabilities

    (p(A) = frac{count(A)}{count(A) + count(B)})

    3b) Biased ML Estimate

    Define the bias of an estimator:
    (Bias_ heta[widehat{ heta}] = E_ heta[widehat{ heta}] - heta = E_ heta[widehat{ heta} - heta])
    Biased estimates of the conditional distribution:
    (s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N})
    Unbiased estimates of the conditional distribution:
    (s^2 = frac{sum_{i=1}^N (x_i-ar{x})}{N-1})
    We need to calculate (mu)

    3c) Class Density

    3d) Posterior

    Task 4: Non-parametric Density Estimation

    4a) Histogram

    Histogram with bin size 0.02 seems like overfitting, but Histogram with bin size 2 has too large bins. It doesn't represent the data distribution well. Histogram with bin size 0.5 is the best.


    4b) Kernel Density Estimate

    A Gaussian kernel with (sigma) = 0.03 has a log likelihood of 2425.84;
    A Gaussian kernel with (sigma) = 0.2 has a log likelihood of 2383.64;
    A Gaussian kernel with (sigma) = 0.8 has a log likelihood of 853.49.
    So the kernel with (sigma) = 0.03 performs the best.

    4c) K-Nearest Neighbors

    The K-nearest neighbors method with K = 2 has a log likelihood of -1256.04;
    The K-nearest neighbors method with K = 8 has a log likelihood of -1127.73;
    The K-nearest neighbors method with K = 35 has a log likelihood of -949.19;
    So the K-nearest neighbors method with K = 35 performs the best.

    4d) Comparison of the Non-Parametric Methods

    Why do we need to test them on a different data set??
    ( egin{array}{c|ccc} n & sigma = 0.03 & sigma = 0.2 & sigma = 0.8 & K = 2 & K = 8 & K = 35 \ hline log likelihood & 2425.84 & 2383.64 & 853.49 & -1256.04 & -1127.73 & -949.19\ end{array} )
    I will use KNN, because it seems more stable than KDE.

    import math
    
    import numpy as np
    import pandas as pd
    from matplotlib import pyplot as plt
    from scipy.spatial.distance import cdist
    
    training_data = pd.read_csv("nonParamTrain.txt", sep="   ")
    test_data = pd.read_csv("nonParamTest.txt", sep="   ")
    
    training_data.columns = test_data.columns = ["value"]
    
    x_min = -4
    x_max = 8
    
    # 4a) Histogram
    def plot_histo():
        histo_size = [0.02, 0.5, 2]
    
        for i, s in enumerate(histo_size):
            plt.figure(i)
            # number of bins = training_data.max().value / s
            training_data.plot.hist(by="value", bins=math.ceil(training_data.max().value / s))
            plt.xlabel("x")
            plt.title("Histogram with bin size {}".format(s))
            plt.xlim(x_min, x_max)
    
    
    def gaussian_kernel(x, data, sigma):
        numerator = np.sum(np.exp(-(x - data) ** 2 / (2 * sigma ** 2)))
        denominator = np.sqrt(2 * math.pi) * sigma
        return numerator / denominator
    
    # 4b) Kernel Density Estimate
    def gaussian_KDE():
        sigmas = [0.03, 0.2, 0.8]
        steps = (x_max - x_min) / 500
        x = np.arange(x_min, x_max, steps)
        # x = np.sort(test_data.values, axis=0)
        plt.figure()
        for sigma in sigmas:
    
            # get log-likelihood
            # lecture05 slides48
            y = np.empty(training_data.values.shape[0])
            for i, val in enumerate(training_data.values):
                y[i] = gaussian_kernel(val, training_data.values, sigma)
    
            print("The train log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))
    
            # get plots
            y = np.empty(x.shape)
            for i, val in enumerate(x):
                y[i] = gaussian_kernel(val, training_data.values, sigma)
    
            print("The test log−likelihood for sigma = {} is {}".format(str(sigma), str(np.sum(np.log(y)))))
    
            plt.plot(x, y, label="$sigma=$" + str(sigma))
            plt.ylabel('Density')
            plt.xlabel('x')
    
        plt.legend()
        plt.show()
    
    
    def knn():
        ks = [2, 8, 35]
        steps = (x_max - x_min) / 300
        x = np.arange(x_min, x_max, steps)
        # calculate pairwise distances
        x_dist = cdist(x.reshape(x.shape[0], 1),
                       training_data.values.reshape(training_data.values.shape[0], 1),
                       metric="euclidean")
    
        for k in ks:
            y = np.empty(x.shape)
            for i, val in enumerate(x_dist):
                # find nearest k points and take point with greatest distance as Volume size
                # this assumes the distance matrix was computed with two different vectors
                # use k+1 for train data
                # np.argpartition(val, range(k))[:k] means top k element
                V = val[np.argpartition(val, range(k))[:k]][-1]
                # calculate density
                y[i] = k / (training_data.values.shape[0] * V * 2)
    
            print("The log−likelihood for k={} is {}"
                  .format(k, np.sum(np.log(y))))
    
            plt.plot(x, y, label="k={}".format(k))
            plt.ylabel('Density')
            plt.xlabel('x')
    
        plt.legend()
        plt.show()
    
    
    # plot_histo()
    gaussian_KDE()
    knn()
    plt.show()
    

    Task 5: Expectation Maximization

    5a) Gaussian Mixture Update Rules

    Input: (x = (x^{(1)}, x^{(2)},cdots,x^{(m)})), Joint Distribution (p(x,1| heta)), conditional distribution (p(z|x, heta)), Iteration times (J).

    1. Randomly initialize model parameter ( heta), for example ( heta^0)
    2. for j from 1 to J:
      Expectation: Calculate conditional probability expectation of joint distribution:
      (Q_i(Z^{(i)}) := P(z^{(i)}|x^{(i)}, heta))
      Maximization: Maximize (L( heta)), we become ( heta):
      ( heta := argmax sum_{i=1}^msum_{z^{(i)}}Q_i(z^{(i)})logP(x^{(i)}, z^{(i)} | heta))
      Repeat expectation and maximization until ( heta) converge.
      Output: model parameter ( heta)

    5b) EM

    Iteration from 1 step to 30 steps:





    Log likelihood change:

    Full code is behind

    import math
    
    import numpy as np
    import pandas as pd
    from matplotlib import pyplot as plt
    
    training_data = pd.read_csv("gmm.txt", sep="  ")
    k = 4
    n_iter = 30
    
    
    def main():
        # init
        covar = np.array(
            [np.identity(training_data.shape[1]) for _ in range(k)])
        mu = np.random.uniform(training_data.min().min(), training_data.max().max(),
                               size=(k, training_data.shape[1]))
        pi = np.random.uniform(size=(k,))
    
        log_likelihood = np.empty((n_iter,))
    
        for i in range(n_iter):
            alpha = e(x=training_data.values, mu=mu, covar=covar, pi=pi)
            mu, covar, pi = m(x=training_data.values, alpha=alpha)
    
            # plot at given steps
            if i + 1 in [1, 3, 5, 10, 30]:
                plt.figure(i)
                visualize(mu, covar, training_data.values, i)
            log_likelihood[i] = calculate_log_likelihood(training_data.values, mu, covar, pi)
    
            print("Finished iteration {}".format(i))
    
        plt.figure(n_iter + 1)
        plt.plot(log_likelihood)
        plt.xlabel("Iterations")
        plt.ylabel("Log-Likelihood")
        plt.title("Log-Likelihood Progress")
        plt.show()
    
    
    def e(x, mu, covar, pi):
        alpha = np.empty((k, x.shape[0]))
        for i in range(k):
            alpha[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])
    
        # sum over all models per data point
        denominator = np.sum(alpha, axis=0)
    
        return alpha / denominator
    
    
    def m(x, alpha):
        N = np.sum(alpha, axis=1)  # sum over all data points per model
    
        mu = np.zeros((k, x.shape[1]))
        covar = np.zeros((k, x.shape[1], x.shape[1]))
    
        for i in range(k):
            # update mu
            for j, val in enumerate(x):
                mu[i] += (alpha[i, j] * val)
    
            mu[i] /= N[i]
    
            # update covariance
            for j, val in enumerate(x):
                diff = val - mu[i]
                covar[i] += alpha[i, j] * np.outer(diff, diff.T)
    
            covar[i] /= N[i]
    
        # update pi
        pi = N / x.shape[0]
    
        return mu, covar, pi
    
    
    def multivariate_gaussian(data, mu, covar):
    
        out = np.empty(data.shape[0])
        denominator = np.sqrt((2 * math.pi) * np.linalg.det(covar))
        covar_inv = np.linalg.inv(covar)
    
        # compute for each datapoint
        for i, x in enumerate(data):
            diff = x - mu
            out[i] = np.exp(-0.5 * diff.T.dot(covar_inv).dot(diff)) / denominator
    
        return out
    
    
    def visualize(mu, covar, data, iteration):
    
        steps = 100
    
        x_data = data[:, 0]
        y_data = data[:, 1]
    
        x_min = x_data.min()
        x_max = x_data.max()
        y_min = y_data.min()
        y_max = y_data.max()
    
        x = np.arange(x_min - 1, x_max + 1, (x_max - x_min + 2) / steps)
        y = np.arange(y_min - 1, y_max + 1, (y_max - y_min + 2) / steps)
    
        Y, X = np.meshgrid(y, x)
        Z = np.empty((steps, steps))
    
        for i in range(k):
            for j in range(steps):
                # construct vector with same x and all possible y to cover the plot space
                points = np.append(X[j], Y[j]).reshape(2, x.shape[0]).T
                Z[j] = multivariate_gaussian(points, mu[i], covar[i])
            plt.contour(X, Y, Z, 1)
    
        # plot the samples
        plt.plot(x_data, y_data, 'co', zorder=1)
        plt.xlabel('x')
        plt.ylabel('y')
        plt.title("Mixtures after {} steps".format(iteration + 1))
    
    
    def calculate_log_likelihood(x, mu, covar, pi):
        likelihood = np.empty((k, x.shape[0]))
        for i in range(k):
            likelihood[i] = pi[i] * multivariate_gaussian(x, mu[i], covar[i])
    
        return np.sum(np.log(np.sum(likelihood, axis=0)))
    
    
    if __name__ == '__main__':
        main()
    


    作者:Rest探路者
    出处:http://www.cnblogs.com/Java-Starter/
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意请保留此段声明,请在文章页面明显位置给出原文连接
    Github:https://github.com/cjy513203427

  • 相关阅读:
    12
    11
    10
    9
    8
    7
    6
    5
    4
    3
  • 原文地址:https://www.cnblogs.com/Java-Starter/p/14847847.html
Copyright © 2011-2022 走看看