tensorflow-ranking bugs
1 在metric函数中给全局变量赋值
报错:
TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
@tf.function
def has_init_scope():
my_constant = tf.constant(1.)
with tf.init_scope():
added = my_constant * 2
The graph tensor has name: add:0
报错代码:
top_one_time = 0
def top_one_accuracy(y_true, y_pred):
max_idx_gt = tf.argsort(y_true)[:, -1]
max_idx_pred = tf.argsort(y_pred)[:, -1]
judge = tf.equal(max_idx_gt, max_idx_pred)
num_true = tf.reduce_sum(tf.cast(judge, tf.int32))
global top_one_time
top_one_time += num_true
return top_one_time
场景:
在metric函数中给全局变量赋值
排查步骤:
-
通过控制变量法定位到此条语句
-
初步判定为tensorflow框架错误,Google,原因可能是在 init_scope 外进行了某变量的初始化,又在 init_scope 内使用了。
-
有解决方案为加下列语句禁用 tf 的 eager模式
tf.compat.v1.disable_eager_execution()
尝试后出现新报错
报错:
tensorflow.python.framework.errors_impl.FailedPreconditionError: 3 root error(s) found.
(0) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
[[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
[[gt/Squeeze/_283]]
(1) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
[[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
[[loss/gt_loss/pairwise_logistic_loss/weighted_loss/num_present/broadcast_weights/assert_broadcastable/is_valid_shape/else/_291/has_valid_nonscalar_shape/then/_1005/has_invalid_dims/ExpandDims_1/_371]]
(2) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
[[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
0 successful operations.
0 derived errors ignored.
搜索后解决方案为:
from tensorflow.python.keras.backend import set_session
from tensorflow.python.keras.models import load_model
tf_config = some_custom_config
sess = tf.Session(config=tf_config)
graph = tf.get_default_graph()
# IMPORTANT: models have to be loaded AFTER SETTING THE SESSION for keras!
# Otherwise, their weights will be unavailable in the threads after the session there has been set
set_session(sess)
model = load_model(...)
# and then in each request (i.e. in each thread):
global sess
global graph
with graph.as_default():
set_session(sess)
model.predict(...)
尝试后发现无效
- 于是回到最初版本寻找问题切入点,联想原因进行尝试,将代码改为
def top_one_accuracy(y_true, y_pred):
max_idx_gt = tf.argsort(y_true)[:, -1]
max_idx_pred = tf.argsort(y_pred)[:, -1]
judge = tf.equal(max_idx_gt, max_idx_pred)
num_true = tf.reduce_sum(tf.cast(judge, tf.int32))
return num_true
错误解决
- 额外探索,将代码改为
top_one_time = 0
def top_one_accuracy(y_true, y_pred):
max_idx_gt = tf.argsort(y_true)[:, -1]
max_idx_pred = tf.argsort(y_pred)[:, -1]
judge = tf.equal(max_idx_gt, max_idx_pred)
num_true = tf.reduce_sum(tf.cast(judge, tf.int32))
global top_one_time
top_one_time += num_true
return num_true
依然报错
2 直接使用tensorflow-ranking.metrics中的函数当作metric函数
报错:
ValueError: tf.function-decorated function tried to create variables on non-first call.
报错代码:
model.compile(metrics=[tfr.metrics.normalized_discounted_cumulative_gain, tfr.metrics.mean_reciprocal_rank])
场景:
直接使用 tensorflow-ranking.metrics 的函数作 metric
排查步骤:
-
通过控制变量法定位到此条语句
-
初步判定为tensorflow框架错误,Google,原因可能是未正确使用 @tf.function 修饰器,但我并未使用它。
-
于是开始阅读 tf-ranking源码
源码:
def normalized_discounted_cumulative_gain(
labels,
predictions,
weights=None,
topn=None,
name=None,
gain_fn=_DEFAULT_GAIN_FN,
rank_discount_fn=_DEFAULT_RANK_DISCOUNT_FN):
"""Computes normalized discounted cumulative gain (NDCG).
Args:
labels: A `Tensor` of the same shape as `predictions`.
predictions: A `Tensor` with shape [batch_size, list_size]. Each value is
the ranking score of the corresponding example.
weights: A `Tensor` of the same shape of predictions or [batch_size, 1]. The
former case is per-example and the latter case is per-list.
topn: A cutoff for how many examples to consider for this metric.
name: A string used as the name for this metric.
gain_fn: (function) Transforms labels. Note that this implementation of
NDCG assumes that this function is *increasing* as a function of its
imput.
rank_discount_fn: (function) The rank discount function. Note that this
implementation of NDCG assumes that this function is *decreasing* as a
function of its imput.
Returns:
A metric for the weighted normalized discounted cumulative gain of the
batch.
"""
metric = metrics_impl.NDCGMetric(name, topn, gain_fn, rank_discount_fn)
with tf.compat.v1.name_scope(metric.name,
'normalized_discounted_cumulative_gain',
(labels, predictions, weights)):
per_list_ndcg, per_list_weights = metric.compute(labels, predictions,
weights)
return tf.compat.v1.metrics.mean(per_list_ndcg, per_list_weights)
发现每次调用此函数都会生成一个 metrics_impl.NDCGMetric 对象,可能因此导致某些函数在非初始化时被运行,从而错误(原因)
- 于是自己写了一个函数代替。先初始化这个 metrics_impl.NDCGMetric 对象,然后每次调用函数时调用它的compute
ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)
def metric_ndcg_topn(y_true, y_pred):
return ndcg_topn.compute(y_true, y_pred, None)
调用代码:
model.compile(metrics=metric_ndcg_topn)
错误解决
- 额外探索。下列代码依然报错,判断是 tf.compat.v1.metrics.mean 有问题
ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)
def metric_ndcg_topn(y_true, y_pred):
per_list_ndcg, per_list_weights = ndcg_topn.compute(y_true, y_pred, None)
return tf.compat.v1.metrics.mean(per_list_ndcg, per_list_weights)
- 额外探索。下列代码不报错,但是输出不对
ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)
mean = tf.keras.metrics.Mean()
def metric_ndcg_topn(y_true, y_pred):
per_list_ndcg, per_list_weights = ndcg_topn.compute(y_true, y_pred, None)
return mean(per_list_ndcg, per_list_weights)