zoukankan      html  css  js  c++  java
  • Fix multiple GPUs fails in training Mask_RCNN

    Test with:

    Keras: 2.2.4
    Python: 3.6.9
    Tensorflow: 1.12.0

    ==================

    Problem:

    Using code from https://github.com/matterport/Mask_RCNN

    When setting GPU_COUNT > 1

    enconter this error:

    RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.
    Traceback (most recent call last):
      File "D:Anaconda33libsite-packageskerasengine
    etwork.py", line 313, in __setattr__
        is_graph_network = self._is_graph_network
      File "parallel_model.py", line 46, in __getattribute__
        return super(ParallelModel, self).__getattribute__(attrname)
    AttributeError: 'ParallelModel' object has no attribute '_is_graph_network'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "parallel_model.py", line 159, in <module>
        model = ParallelModel(model, GPU_COUNT)
      File "parallel_model.py", line 35, in __init__
        self.inner_model = keras_model
      File "D:Anaconda33libsite-packageskerasengine
    etwork.py", line 316, in __setattr__
        'It looks like you are subclassing `Model` and you '
    RuntimeError: It looks like you are subclassing `Model` and you forgot to call `super(YourClass, self).__init__()`. Always start with this line.

    Solution 1:

    changing code in mrcnn/parallel_model.py as the following:

    class ParallelModel(KM.Model):
        def __init__(self, keras_model, gpu_count):
            """Class constructor.
            keras_model: The Keras model to parallelize
            gpu_count: Number of GPUs. Must be > 1
            """
            super(ParallelModel, self).__init__()
            self.inner_model = keras_model
            self.gpu_count = gpu_count
            merged_outputs = self.make_parallel()
            super(ParallelModel, self).__init__(inputs=self.inner_model.inputs,
                                                outputs=merged_outputs)

    When getting this error:

    asking for two arguments: inputs and outputs

    Just upgrade your Keras to 2.2.4

    When getting this error:

    No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.
    Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
    with tf.device(/gpu:1): <M: ewmrcnnparallel_model.py:70>

    No node-device colocations were active during op 'anchors/Variable' creation.
    No device assignments were active during op 'anchors/Variable' creation.

    Traceback (most recent call last):
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1334, in _do_call
        return fn(*args)
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1317, in _run_fn
        self._extend_graph()
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1352, in _extend_graph
        tf_session.ExtendSession(self._session)
    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes {{colocation_node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0}} and {{colocation_node anchors/Variable}}: Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'
             [[{{node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0}} = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "train_mul.py", line 448, in <module>
        "mrcnn_bbox", "mrcnn_mask"])
      File "M:
    ewmrcnnmodel.py", line 2132, in load_weights
        saving.load_weights_from_hdf5_group_by_name(f, layers)
      File "D:Anaconda33libsite-packageskerasenginesaving.py", line 1022, in load_weights_from_hdf5_group_by_name
        K.batch_set_value(weight_value_tuples)
      File "D:Anaconda33libsite-packageskerasackend	ensorflow_backend.py", line 2440, in batch_set_value
        get_session().run(assign_ops, feed_dict=feed_dict)
      File "D:Anaconda33libsite-packageskerasackend	ensorflow_backend.py", line 197, in get_session
        [tf.is_variable_initialized(v) for v in candidate_vars])
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 929, in run
        run_metadata_ptr)
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1152, in _run
        feed_dict_tensor, options, run_metadata)
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1328, in _do_run
        run_metadata)
      File "D:Anaconda33libsite-packages	ensorflowpythonclientsession.py", line 1348, in _do_call
        raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:
    ewmrcnnmodel.py:1936) having device Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
      with tf.device(/gpu:1): <M:
    ewmrcnnparallel_model.py:70>  and node anchors/Variable (defined at M:
    ewmrcnnmodel.py:1936) having device No device assignments were active during op 'anchors/Variable' creation. : Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'
             [[node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:
    ewmrcnnmodel.py:1936)  = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]
    
    No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.
    Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
      with tf.device(/gpu:1): <M:
    ewmrcnnparallel_model.py:70>
    
    No node-device colocations were active during op 'anchors/Variable' creation.
    No device assignments were active during op 'anchors/Variable' creation.
    
    Caused by op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0', defined at:
      File "train_mul.py", line 417, in <module>
        model_dir=MODEL_DIR)
      File "M:
    ewmrcnnmodel.py", line 1839, in __init__
        self.keras_model = self.build(mode=mode, config=config)
      File "M:
    ewmrcnnmodel.py", line 2064, in build
        model = ParallelModel(model, config.GPU_COUNT)
      File "M:
    ewmrcnnparallel_model.py", line 36, in __init__
        merged_outputs = self.make_parallel()
      File "M:
    ewmrcnnparallel_model.py", line 80, in make_parallel
        outputs = self.inner_model(inputs)
      File "D:Anaconda33libsite-packageskerasenginease_layer.py", line 457, in __call__
        output = self.call(inputs, **kwargs)
      File "D:Anaconda33libsite-packageskerasengine
    etwork.py", line 570, in call
        output_tensors, _, _ = self.run_internal_graph(inputs, masks)
      File "D:Anaconda33libsite-packageskerasengine
    etwork.py", line 724, in run_internal_graph
        output_tensors = to_list(layer.call(computed_tensor, **kwargs))
      File "D:Anaconda33libsite-packageskeraslayerscore.py", line 682, in call
        return self.function(inputs, **arguments)
      File "M:
    ewmrcnnmodel.py", line 1936, in <lambda>
        anchors = KL.Lambda(lambda x: tf.Variable(anchors), name="anchors")(input_image)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 183, in __call__
        return cls._variable_v1_call(*args, **kwargs)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 146, in _variable_v1_call
        aggregation=aggregation)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 125, in <lambda>
        previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariable_scope.py", line 2444, in default_variable_creator
        expected_shape=expected_shape, import_scope=import_scope)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 187, in __call__
        return super(VariableMetaclass, cls).__call__(*args, **kwargs)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 1329, in __init__
        constraint=constraint)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 1480, in _init_from_args
        self._initial_value),
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 2177, in _try_guard_against_uninitialized_dependencies
        return self._safe_initial_value_from_tensor(initial_value, op_cache={})
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 2195, in _safe_initial_value_from_tensor
        new_op = self._safe_initial_value_from_op(op, op_cache)
      File "D:Anaconda33libsite-packages	ensorflowpythonopsvariables.py", line 2241, in _safe_initial_value_from_op
        name=new_op_name, attrs=op.node_def.attr)
      File "D:Anaconda33libsite-packages	ensorflowpythonutildeprecation.py", line 488, in new_func
        return func(*args, **kwargs)
      File "D:Anaconda33libsite-packages	ensorflowpythonframeworkops.py", line 3274, in create_op
        op_def=op_def)
      File "D:Anaconda33libsite-packages	ensorflowpythonframeworkops.py", line 1770, in __init__
        self._traceback = tf_stack.extract_stack()
    
    InvalidArgumentError (see above for traceback): Cannot colocate nodes node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:
    ewmrcnnmodel.py:1936) having device Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
      with tf.device(/gpu:1): <M:
    ewmrcnnparallel_model.py:70>  and node anchors/Variable (defined at M:
    ewmrcnnmodel.py:1936) having device No device assignments were active during op 'anchors/Variable' creation. : Cannot merge devices with incompatible ids: '/device:GPU:0' and '/device:GPU:1'
             [[node tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0 (defined at M:
    ewmrcnnmodel.py:1936)  = Identity[T=DT_FLOAT, _class=["loc:@anchors/Variable"], _device="/device:GPU:1"](tower_1/mask_rcnn/anchors/Variable/cond/Merge)]]
    
    No node-device colocations were active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation.
    Device assignments active during op 'tower_1/mask_rcnn/anchors/Variable/anchors/Variable/read_tower_1/mask_rcnn/anchors/Variable_0' creation:
      with tf.device(/gpu:1): <M:
    ewmrcnnparallel_model.py:70>
    
    No node-device colocations were active during op 'anchors/Variable' creation.
    No device assignments were active during op 'anchors/Variable' creation.
    View Code

    Adding this line:

    import keras.backend.tensorflow_backend as KTF
    
    config = tf.ConfigProto()
    config.allow_soft_placement=True
    session = tf.Session(config=config)
    KTF.set_session(session)

    Solution 2:(not recommended)

    downgrade Keras to 2.1.3:

    conda install keras=2.1.3

    (this works for someone but not works for me)

    Reference:

    https://github.com/matterport/Mask_RCNN/issues/921

    https://github.com/tensorflow/tensorflow/issues/2285

  • 相关阅读:
    postgis 利用 php 返回geojson格式数据
    openlayers 3读取加载geojson格式数据
    openlayers 3加载百度、高德、google瓦片地图
    ol2 和 bootstrap样式冲突的问题
    Openlayers 2 取消鼠标缩放地图的功能
    Struts2之2.5.10配置
    ol3修改右下键的Attribution
    openlayers 2 高亮显示元素以及通过属性查询高亮某一元素
    sql查看锁与解锁
    使用jQuery解析JSON数据
  • 原文地址:https://www.cnblogs.com/jins-note/p/11671929.html
Copyright © 2011-2022 走看看