zoukankan html css js c++ java

远程配置 tensorflow 环境

1.为了对比别人的方法，需要配置的环境为：Python 3.6.4，Keras 2.1.6，Tensorflow 1.7.0

在自己电脑上，anaconda3, 直接用原环境下的 tensorflow-gpu1.13.1 发现最开始的部分代码段可以运行，但无法保存model 的代码段不起作用，造成错误。

 callbacks=[ModelCheckpoint(filepath=filepath_INCV, monitor='val_acc', verbose=1, save_best_only=INCV_save_best),

只好寻求安装虚拟环境，准备一模一样的设置。

2. 新开了一个虚拟环境，tf1.7.0, 刚开始没有安装成功，似乎是选用的python版本没有完全对应，删除环境后又重新开始，成功了。似乎如果用 conda install tensorflow-gpu=1.7.0 命令在线安装tensorflow 时，会自动安装所需的 cuda9.0 toolkit 以及 cudadnn 包。

然后运行 import tensorflow as tf 会报缺少 numpy 和 pandas 等包，依次是用 conda install 安装就好，其会自动安装对应可用的版本。

import tensorflow 时，还会报

/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:458: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:459: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:460: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:461: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:462: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/guixj/anaconda3/envs/tf1.7.0/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:465: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

不过这个不是太大问题，可以修改，也可以不改。

2019-12-24 10:12:41.915407: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2019-12-24 10:12:41.915435: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2019-12-24 10:12:41.915442: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2019-12-24 10:12:41.915447: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2019-12-24 10:12:41.915469: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

这个也不是太大问题，可以改也可以不改。

安装完 keras 后，发现有一个小bug, keras 与 tensorflow 不兼容，参照 https://ask.csdn.net/questions/687001 将 tf.nn.softmax(*, axis=axis) 换成 dim=axis, 然后就修复了。

softmax() got an unexpected keyword argument 'axis'

3. 在远程服务器 Ubuntu18.04 上配置上述环境

这个其实有一点麻烦，因为 tensorflow-gpu 1.7.0 需要cuda9.0 版本，而 cuda9.0 版本原本只支持在 ubuntu 16.04 和 ubuntu17.10 上安装。https://blog.csdn.net/hellocsz/article/details/88372819 指出 python3.6.4 对应 anaconda3-5.1.0，于是从 https://mirrors.tuna.tsinghua.edu.cn 下载了对应版本的 anaconda。

https://blog.csdn.net/qq_27825451/article/details/89082978 指出（并且此文详细地阐述了 cuda 、cudnn 和 graphical driver(显卡驱动）之间的关系）

tensorflow_gpu-1.7.0    python 2.7、  python 3.3-3.6    GCC 4.8    Bazel 0.9.0    cudnn 7   cuda 9

从 nvidia 官网下载了 cuda9.2 及其一个补丁文件 ( 由于没有对应 Ubuntu 18.04，于是下载了 Ubuntu 17.10的，但按照下面博客，似乎下载对应 ubuntu16.04 的可能会更好)，然后同时下载对应于 cuda 9.2 的 cudnn 文件

而 ubuntu18.04 的 gcc（g++）编译器和内核版本都过高，造成了一些可能存在的问题。按照A文（ https://www.jianshu.com/p/00c37b09f0f3 ）及其进一步的链接 https://www.jianshu.com/p/f66eed3a3a25 切换了 gcc(g++) 的版本。

然后进一步按照 A 文的指示安装 cuda9.2, 第一次选择了安装显卡驱动，但报如下错误：

Installing the NVIDIA display driver...
The driver installation is unable to locate the kernel source. Please make sure that the kernel source packages are installed and set up correctly.
If you know that the kernel source packages are installed and set up correctly, you may pass the location of the kernel source with the '--kernel-source-path' flag.

===========
= Summary =
===========

Driver:   Installation Failed
Toolkit:  Installation skipped
Samples:  Not Selected

查了其他的一些博客都说是内核版本过高，

https://askubuntu.com/questions/829890/nvidia-driver-install-fails-unable-to-locate-the-kernel-source
https://blog.51cto.com/xiaoxiaozhou/2344649?source=dra
2cto.com/net/201904/804672.html
https://blog.csdn.net/net_wolf/article/details/100178800

需要降低内核版本，但我有点嫌弃太麻烦了，并且 A 文也未提此。于是心存侥幸，又试了一下，这一次没有选择安装显卡驱动。

奇怪的是这一次居然 toolkit 安装成功了。莫非内核版本过高只影响显卡驱动？不影响 cuda toolkit? 暂时还不是非常清楚

安装 cudnn v7.4, 奇怪的是其下载下来后名字居然是 cudnn-9.2, 应该是为了和 cuda 版本号保持一致。按照 https://blog.csdn.net/fengliang4616/article/details/90142747 设置就好

于是进一步安装 anaconda，注意anaconda 版本的选择，保证其默认安装的 Python版本就是所需的 Python 版本，其他按照默认设置就好。

奇怪! 远程服务器居然可以联网，并且使用 conda install tensorflow-gpu=1.7.0, 并且居然重装了 Python，cudnn7.6.4, cuda toolkit9.0

Downloading and Extracting Packages
xz 5.2.4: ############################################################## | 100% 
pip 19.3.1: ############################################################ | 100% 
python 3.6.6: ########################################################## | 100% 
absl-py 0.8.1: ######################################################### | 100% 
libedit 3.1.20181209: ################################################## | 100% 
tensorflow-gpu 1.7.0: ################################################## | 100% 
cupti 9.0.176: ######################################################### | 100% 
openssl 1.0.2t: ######################################################## | 100% 
libprotobuf 3.6.0: ##################################################### | 100% 
bleach 1.5.0: ########################################################## | 100% 
cudnn 7.6.4: ########################################################### | 100% 
certifi 2019.11.28: #################################################### | 100% 
sqlite 3.30.1: ######################################################### | 100% 
readline 7.0: ########################################################## | 100% 
libgcc-ng 9.1.0: ####################################################### | 100% 
cudatoolkit 9.0: ####################################################### | 100% 
ca-certificates 2019.11.27: ######################################################################################################################################################################## | 100% 
werkzeug 0.16.0: ################################################################################################################################################################################### | 100% 
grpcio 1.12.1: ##################################################################################################################################################################################### | 100% 
protobuf 3.6.0: #################################################################################################################################################################################### | 100% 
zlib 1.2.11: ####################################################################################################################################################################################### | 100% 
_libgcc_mutex 0.1: ################################################################################################################################################################################# | 100% 
numpy 1.14.2: ###################################################################################################################################################################################### | 100% 
astor 0.8.0: ####################################################################################################################################################################################### | 100% 
gast 0.3.2: ######################################################################################################################################################################################## | 100% 
html5lib 0.9999999: ################################################################################################################################################################################ | 100% 
markdown 3.1.1: #################################################################################################################################################################################### | 100% 
setuptools 42.0.2: ################################################################################################################################################################################# | 100% 
blas 1.0: ########################################################################################################################################################################################## | 100% 
wheel 0.33.6: ###################################################################################################################################################################################### | 100% 
tensorflow-gpu-base 1.7.0: ######################################################################################################################################################################### | 100% 
tk 8.6.8: ########################################################################################################################################################################################## | 100% 
six 1.13.0: ######################################################################################################################################################################################## | 100% 
termcolor 1.1.0: ################################################################################################################################################################################### | 100% 
tensorboard 1.7.0: ################################################################################################################################################################################# | 100% 
ncurses 6.1: ####################################################################################################################################################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

继而在线安装 keras-2.1.6

root@ubuntu:~/anaconda3# conda install keras=2.1.6
Solving environment: done
## Package Plan ##
  environment location: /root/anaconda3
  added / updated specs: 
    - keras=2.1.6
The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    tensorflow-base-1.7.0      |   py36hdbcaa40_2        38.7 MB
    keras-2.1.6                |           py36_0         500 KB
    tensorflow-1.7.0           |                0           3 KB
    ------------------------------------------------------------
                                           Total:        39.2 MB
The following NEW packages will be INSTALLED:

    keras:           2.1.6-py36_0        
    tensorflow:      1.7.0-0             
    tensorflow-base: 1.7.0-py36hdbcaa40_2

Proceed ([y]/n)? y
Downloading and Extracting Packages
tensorflow-base 1.7.0: ############################################################################################################################################################################# | 100% 
keras 2.1.6: ####################################################################################################################################################################################### | 100% 
tensorflow 1.7.0: ################################################################################################################################################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

安装好了后，代码可以运行，但有一个问题，就是 keras 调用的是 tensorflow 而不是 tensorflow-gpu, 导致程序用CPU 运行，而未调用GPU，致使十分缓慢。

查了下博客，https://blog.csdn.net/m0_37151059/article/details/85238696，使用 conda install keras, tensorflow, tensorflow-gpu

但发现 python 后还能导入 tensorflow, 猜测 tensorflow 没有卸载干净，conda list 后发现还有 tensorflow-base, 等文件存在，按照 https://blog.csdn.net/weixin_37142859/article/details/85845559

root@ubuntu:~/guixj/noisy_label_understanding_utilizing-master# conda uninstall protobuf
Solving environment: done
## Package Plan ##
  environment location: /root/anaconda3
  removed specs: 
    - protobuf

The following packages will be REMOVED:
    protobuf:            3.6.0-py36hf484d3e_0
    tensorboard:         1.7.0-py36hf484d3e_1
    tensorflow-base:     1.7.0-py36hdbcaa40_2
    tensorflow-gpu-base: 1.7.0-py36hcdda91b_1

移除了这些包。

然后再用 conda install keras=2.1.6 tensorflow-gpu=1.7.0 但是仍然提示安装 tensorflow,

root@ubuntu:~/guixj/noisy_label_understanding_utilizing-master# conda install tensorflow-gpu=1.7.0 keras=2.1.6
Solving environment: done
## Package Plan ##
  environment location: /root/anaconda3
  added / updated specs: 
    - keras=2.1.6
    - tensorflow-gpu=1.7.0

The following NEW packages will be INSTALLED:
    keras:               2.1.6-py36_0        
    protobuf:            3.6.0-py36hf484d3e_0
    tensorboard:         1.7.0-py36hf484d3e_1
    tensorflow:          1.7.0-0             
    tensorflow-base:     1.7.0-py36hdbcaa40_2
    tensorflow-gpu:      1.7.0-0             
    tensorflow-gpu-base: 1.7.0-py36hcdda91b_1

只好尝试用 pip 安装, 原本 pip 的速度极慢，换用清华源后速度快了很多（https://www.cnblogs.com/ceeyo/p/11691153.html）

pip install tensorflow-gpu==1.7.0 keras==2.1.6 -i https://pypi.tuna.tsinghua.edu.cn/simple/  # pip 必须用 ==， 而 conda 可用 = or ==

Installing collected packages: protobuf, tensorboard, tensorflow-gpu, keras
Successfully installed keras-2.1.6 protobuf-3.11.2 tensorboard-1.7.0 tensorflow-gpu-1.7.0

果然用 pip 安装这一次没有安装 tensorflow 的 cpu 版，而用 conda 装就会安装 tensorflow 的 cpu 版

果然，这一次成功地使用了 GPU，但似乎程序直接独占了两个GPU，还用了大量的显存，

报错

2019-12-25 01:46:43.584556: E tensorflow/stream_executor/cuda/cuda_dnn.cc:396] Loaded runtime CuDNN library: 7604 (compatibility version 7600) but source was compiled with 7005 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2019-12-25 01:46:43.585338: F tensorflow/core/kernels/conv_ops.cc:712] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted (core dumped)

发现是 cudnn 和 cuda 版本不兼容，有一文要手动操作比较麻烦，https://blog.csdn.net/xhjj520/article/details/78955478。

继而尝试使用 conda 包来自动安装，发现确实可以用 conda 来安装 cudnn https://blog.csdn.net/weixin_40588315/article/details/85881338 但是版本似乎对应不上，cuda9.0 至少是 cudnn7.1, 但装了 cudnn7.1 后仍然报错，一直要 cudnn7.0, 这对应了 cuda8.0 才对啊。使用 coda install 胡乱操作了一波，似乎还是不行。无法把 cudnn 7.0.5 和 cuda9.0 匹配上。 conda 包里的 cudnn 7.05 只能搜到一个对应 cuda8 系列的，但是 cudnn7.05 既有对应 cuda8 的，也有对应cuda 9 的

准备 https://blog.csdn.net/mangobar/article/details/93624545 按照此博客，重新安装 cuda9.0 cudnn 7.0.5

首先卸载 cuda9.2, 由于安装cudnn 只是子文件copy，故完全删去 cuda 后 cudnn 也顺带着删去了。若只删去 cudnn, 而不删去 cuda, 可参见此 https://blog.csdn.net/zdx004/article/details/88014160

重新安装 cuda9.0, cudnn 的对应版本。

然后卸载原来的anaconda, 直接把anaconda-3 文件夹都删掉了（https://my.oschina.net/wangsifangyuan/blog/1575464），重新安装anaconda 3

安装完成后，接着安装 pip install tensorflow-gpu=1.7.0, keras=2.1.6 -i 清华源. 这一次没有再敢用 conda install, 一者其会自动安装不同版本的 cuda 和 cudnn, 导致错误，二其会自动安装 cpu 版的 tensorflow，导致错误。

测试：运行 python， import tensorflow as tf 后，然后发现：

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

猜测是没有写cuda的路径，按照如下(https://www.jianshu.com/p/00c37b09f0f3) 在 .bashrc 中写了cuda 的路径后，运行 source ～.bashrc 发现可以了。注意，误输入 exit() 会抵消到 source 的作用，导致需要再source ~/.bashrc

export PATH=/usr/local/cuda-9.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64:$LD_LIBRARY_PATH

成功地用上了 GPU 后，飞一般的速度！

但 keras 默认全部独自占满所有 GPU 显存，第一次，把两块卡的 GPU显存都占了，然后报错了。如下设置，变成了独自占满一块卡。但个人感觉 CIFAR-10 应该用不了这么大的显存（32G），理论上 10 G 就足够了才是。

import os
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda)  #可以如此设置 

import tensorflow as tf  # change 
import keras.backend.tensorflow_backend as KTF # change
KTF.set_session(tf.Session(config=tf.ConfigProto(device_count={'gpu':int(args.cuda)})))

参照 https://blog.csdn.net/sinat_26917383/article/details/75633754，设置

#不全部占满显存, 按需分配
import keras.backend.tensorflow_backend as KTF
import tensorflow as tf
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda) 
config = tf.ConfigProto()
config.gpu_options.allow_growth=True  
sess = tf.Session(config=config)
KTF.set_session(sess)

显存占用大小成功从 32 G 降到了 2.8 G

2020.7.6 参照 DIEN 模型的代码, 设置不要独占显存：

gpu_options = tf.GPUOptions(allow_growth=True) # borrow from DIEN
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
    # all train code in here

　tensorflow-gpu 版本会直接默认使用 gpu, 即使代码中没有出现任何一个 GPU 的字眼。（参见 SLi_Rec 代码）

# 待续

查看全文

相关阅读:
PAT Advanced 1044 Shopping in Mars (25) [⼆分查找]
PAT Advanced 1029 Median (25) [two pointers]
PAT Advanced 1010 Radix(25) [⼆分法]
PAT Basic 1070 结绳(25) [排序，贪⼼]
PAT Basic 1023 组个最⼩数 (20) [贪⼼算法]
PAT Basic 1020 ⽉饼 (25) [贪⼼算法]
PAT Advanced 1070 Mooncake (25) [贪⼼算法]
PAT Advanced 1067 Sort with Swap(0,*) (25) [贪⼼算法]
PAT Advanced 1038 Recover the Smallest Number (30) [贪⼼算法]
PAT Advanced 1037 Magic Coupon (25) [贪⼼算法]

原文地址：https://www.cnblogs.com/Gelthin2017/p/12094276.html