zoukankan      html  css  js  c++  java
  • Comparing Differently Trained Models

    Comparing Differently Trained Models

    At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.

    To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.

    In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.

    (I) loss=1.64000 ||W||=3.56, bias=-0.60811 (SGD)
    (II) loss=0.04711 ||W||=3.75, bias=-0.58073 (L-BFGS)

    As we can see, the L2 norm of the final weight vector is similar, also the bias, but of course we do not care for absolute norms but rather for the correlation of both solutions. For that reason, we converted both weight vectors W to unit-norm and determined the cosine similarity: correlation = W_sgd.T * W_lbfgs = 0.977.

    Since we do not have any empirical data for such correlations, we analyzed the magnitude of the features in the weight vectors. More precisely the top-5 most important features:

    (I) werewolf=0.6652, vampire=0.2394, creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
    (II) werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511, teenagers=0.1279

    If we also consider the top-12 features of both models, which are pretty similar,

    (I) werewolf, vampire, creature, forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural, mansion, bloodsucker
    (II) werewolf, vampire, monster, creature, teenagers, curse, forbidden-love, supernatural, pregnancy, hunting, undead, beast

    we can see some patterns here: First, a lot of the movies in the dataset seem to combine the theme with love stories that may involve teenagers. This makes sense because this is actually a very popular pattern these days and second, vampires and werewolves are very likely to co-occur in the same movie.

    Those patterns were learned by both models, regardless of the actual optimization method but with minor differences which can be seen by considering the magnitude of the individual weights in W. However, as the correlation of the parameters vectors confirmed, both solutions are pretty close together.

    Bottom line, we should be careful with interpretations since the data at hand was limited, but nevertheless the results confirmed that with proper initializations and hyper-parameters, good solutions can be both achieved with 1st and 2nd order methods. Next, we will study the ability of models to generalize for unseen data.

     
  • 相关阅读:
    Linux 下如何查看一个组内的有哪些用户
    Linux下查看用户列表
    用pureftpd+pureDB虚拟用户,建立一个简单安全(不需要数据库支持)的linux ftp站
    pure-ftp中如何设置匿名帐号,自定义匿名帐号访问指定目录
    PUTTY中永久更改字体大小
    Pure-ftpd 如何配置多用户,且A用户具有上传下载权限,B用户只有下载权限?
    windows10访问ftp中文乱码怎么办?
    WPF DynamicDataDisplay.dll 下载
    C# windows presentation foundation 项目中不支持Application
    c# NDP472-KB4054530-x86-x64-AllOS-CHS下载
  • 原文地址:https://www.cnblogs.com/yymn/p/4842594.html
Copyright © 2011-2022 走看看