zoukankan      html  css  js  c++  java

本文记录笔者在Tensorflow使用上的一些错误的集锦,方便后来人迅速查阅解决问题。

我是留白。

我是留白。

CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2018-12-05 22:18:24.565303: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
2018-12-05 22:18:24.565372: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
2018-12-05 22:18:24.569212: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc:
2018-12-05 22:18:26.170901: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
2018-12-05 22:18:26.170969: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
2018-12-05 22:18:26.174856: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3330
2018-12-05 22:18:27.177003: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3376}
2018-12-05 22:18:27.177071: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:3330, 1 -> localhost:3331}
2018-12-05 22:18:27.180980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3331
2018-12-05 22:18:34.625459: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-12-05 22:18:34.625513: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2018-12-05 22:18:36.231936: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-12-05 22:18:36.231971: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2018-12-05 22:18:37.235899: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-12-05 22:18:37.235952: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

首先保证job_name,task_index,ps_hosts,worker_hosts这四个参数都是正确的,考虑以下这种情况是不正确的:
在一个IP为192.168.1.100的机器上启动ps或worker进程:

1
2
3
4
--job_name=worker
--task_index=1
--ps_hosts=192.168.1.100:2222,192.168.1.101:2222
--worker_hosts=192.168.1.100:2223,192.168.1.101:2223

因为该进程启动位置是192.168.1.100,但是运行参数中指定的task_index为1,对应的IP地址是ps_hosts或worker_hosts的第二项(第一项的task_index为0),也就是192.168.1.101,和进程本身所在机器的IP不一致。

另外一种情况也会导致该问题的发生,从TensorFlow-1.4开始,分布式会自动使用环境变量中的代理去连接,如果运行的节点之间不需要代理互连,那么将代理的环境变量移除即可,在脚本的开始位置添加代码:
注意这段代码必须写在import tensorflow as tf或者import moxing.tensorflow as mox之前

1
2
3
import os
os.enrivon.pop('http_proxy')
os.enrivon.pop('https_proxy')

— 摘自(https://bbs.huaweicloud.com/blogs/463145f7a1d111e89fc57ca23e93a89f)

ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9’ not found

1
2
3
4
5
6
7
8
9
10
大专栏  Tensorflow 错误集锦class="line">11
12
13
14
/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "trainer.py", line 14, in <module>
import sklearn.datasets
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/__init__.py", line 134, in <module>
from .base import clone
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 11, in <module>
from scipy import sparse
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/__init__.py", line 229, in <module>
from .csr import *
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/csr.py", line 15, in <module>
from ._sparsetools import csr_tocsc, csr_tobsr, csr_count_blocks,
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so)

系统的库文件较老,不含CXXABI_1.3.9,可将<Anaconda_PATH>/lib加入LD_LIBRARY_PATH中,像这样:

1
export LD_LIBRARY_PATH=/home/../anaconda3/lib:$

如此,系统会先找到anaconda里面的lib,从而满足要求。

参考:Stackoverflow. 问题2

分布式Tensorflow, ps端运行出现tensorflow.python.framework.errors_impl.UnavailableError: OS Error

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
2018-12-07 15:40:05.167922: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:3333}
2018-12-07 15:40:05.167970: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 192.168.100.36:3333, 1 -> 192.168.100.37:3333}
2018-12-07 15:40:05.171857: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:3333
Parameter server: waiting for cluster connection...
2018-12-07 15:40:05.213496: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:0 returned error: Unavailable: OS Error
2018-12-07 15:40:05.213645: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error
Traceback (most recent call last):
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _run_fn
self._extend_graph()
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1352, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "trainer.py", line 364, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "trainer.py", line 70, in main
num_classes=num_classes)
File "trainer.py", line 138, in parameter_server
sess.run(tf.report_uninitialized_variables())
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/experiment/huqiu/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

如上所示,运行多机分布式 tensorflow 的 parameter server 进程时,出现这个错误。
这里说道:

This has been troubling me for a while. I found out that the problem
is that GRPC uses the native “epoll” polling engine for communication.
Changing this to a portable polling engine solved this issue for me.
The way to do is to set the environment variable,
“GRPC_POLL_STRATEGY=poll” before running the tensorflow programs. This
solved this issue for me. For reference, see,
https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

按照其所属,在环境变量中新增一条:

1
export GRPC_POLL_STRATEGY=poll

成功解决问题。

参考文献

查看全文
  • 相关阅读:
    Luogu P3919【模板】可持久化数组(可持久化线段树/平衡树)
    线段树||BZOJ5194: [Usaco2018 Feb]Snow Boots||Luogu P4269 [USACO18FEB]Snow Boots G
    线段树||BZOJ1593: [Usaco2008 Feb]Hotel 旅馆||Luogu P2894 [USACO08FEB]酒店Hotel
    CF 610E. Alphabet Permutations
    BZOJ 1227: [SDOI2009]虔诚的墓主人
    BZOJ1009: [HNOI2008]GT考试
    BZOJ3674: 可持久化并查集加强版
    BZOJ3261: 最大异或和
    BZOJ2741: 【FOTILE模拟赛】L
    BZOJ3166: [Heoi2013]Alo
  • 原文地址:https://www.cnblogs.com/lijianming180/p/12360826.html
  • Copyright © 2011-2022 走看看