在Relay中注册新TVM算子

zoukankan html css js c++ java

在Relay中注册新TVM算子
在Relay中注册新TVM算子

在本文件中，将介绍在Relay中注册新TVM算子所需的步骤。将以添加累积算子的PR为例。PR本身建立在另一个PR的基础上，该PR添加了一个累积和运算。

注册新算子需要几个步骤：
1. Add an attribute node declaring fixed arguments which are known at compile time
2. Write a type relation for your operation to integrate into Relay’s type system.
3. Use the RELAY_REGISTER_OP macro in C++ to register the operator’s arity, type, and other hints for the compiler
4. Write how the operator is computed
5. Register the compute, schedule with the relay operator
6. Define a C++ function to produce a call node for the operator and registering a Python API hook for the function
7. Wrapping the above Python API hook in a neater interface
8. Writing tests for the new relay operator
详细过程
1. Add an attribute node declaring fixed arguments which are known at compile time
属性是在编译时指定的固定参数。卷积算子的stride和伸缩，属于卷积算子的属性节点中的字段。

属性应该在include/tvm/relay/attrs/中定义。

最终，希望创建一个算子，在python界面中，可以清楚地看到该算子的接口：

def cumprod(data, axis=None, dtype=None, exclusive=None):

    """Numpy style cumprod op. Return the cumulative inclusive product of the elements along

    a given axis.

    Parameters

    ----------

    data : relay.Expr

        The input data to the operator.

    axis : int, optional

        Axis along which the cumulative product is computed. The default (None) is to compute

        the cumprod over the flattened array.

    dtype : string, optional

        Type of the returned array and of the accumulator in which the elements are multiplied.

        If dtype is not specified, it defaults to the dtype of data.

    exclusive : bool, optional

        If true will return exclusive product in which the first element is not

        included. In other terms, if true, the j-th output element would be

        the product of the first (j-1) elements. Otherwise, it would be the product of

        the first j elements. The product of zero elements will be 1.

    Returns

    -------

    result : relay.Expr

        The result has the same size as data, and the same shape as data if axis is not None.

        If axis is None, the result is a 1-d array.

    """

cumsum（）存在类似的接口。

因此，在include/tvm/relay/attrs/transform.h中定义属性时，选择操作的坐标轴、累积数据类型和独占性，作为struct结构体的适当字段。
```
/*! rief Attributes used in cumsum and cumprod operator */
```
```
struct ScanopAttrs : public tvm::AttrsNode<ScanopAttrs> {
```
```
  Integer axis;
```
```
  DataType dtype;
```
```
  Bool exclusive = Bool(false);
```
```
  TVM_DECLARE_ATTRS(ScanopAttrs, "relay.attrs.ScanopAttrs") {
```
```
    TVM_ATTR_FIELD(axis).describe("The axis to operate over").set_default(NullValue<Integer>());
```
```
    TVM_ATTR_FIELD(dtype).describe("Output data type").set_default(NullValue<DataType>());
```
```
    TVM_ATTR_FIELD(exclusive)
```
```
        .describe("The first element is not included")
```
```
        .set_default(Bool(false));
```
```
  }
```
```
};
```
2. Writing a Type Relation

为了允许在注册算子时具有灵活性，以及在Relay中表达类型时，具有更大的表达能力和粒度，使用输入和输出类型之间的关系输入算子。这些关系表示为函数，这些函数接受输入类型和输出类型列表（这些类型中的任何一种都可能不完整），并返回满足该关系的输入和输出类型列表。这包括可在编译时静态确定的shape信息。本质上，算子的关系除了计算输出类型外，还可以强制执行所有必要的类型规则（即通过检查输入类型）。

累积积和算子的类型关系，可在src/relay/op/tensor/transform.cc中找到：
```
TVM_REGISTER_NODE_TYPE(ScanopAttrs);
```
```
bool ScanopRel(const Array<Type>& types, int num_inputs, const Attrs& attrs, const TypeReporter& reporter) {
```
```
    // types: [data, output]
```
```
    ICHECK_EQ(types.size(), 2) << "Expects two types, one for the input and another for the output";
```
```
    const auto* data = types[0].as<TensorTypeNode>();
```
```
    if (data == nullptr) {
```
```
        ICHECK(types[0].as<IncompleteTypeNode>())
```
```
        << "Scanop: expect input type to be TensorType but get " << types[0];
```
```
        return false;
```
```
    }
```
```
 
```
```
    const auto* param = attrs.as<ScanopAttrs>();
```
```
 
```
```
    auto dtype = param->dtype;
```
```
    if (dtype.is_void()) {
```
```
        dtype = data->dtype;
```
```
    }
```
```
 
```
```
    if (param->axis.defined()) {
```
```
        reporter->Assign(types[1], TensorType(data->shape, dtype));
```
```
    } else {
```
```
        auto prod = data->shape[0];
```
```
        for (size_t i = 1; i < data->shape.size(); ++i) {
```
```
            prod = prod * data->shape[i];
```
```
        }
```
```
        reporter->Assign(types[1], TensorType({prod}, dtype));
```
```
    }
```
```
 
```
```
    return true;
```
```
}
```
3. Relating the Arity and Attributes to an Operation

然后，注册新算子的名称，调用接口对其进行注释。C++中的RELAY_REGISTER_OP宏，允许开发人员指定Relay中的算子的以下信息：
- Arity (number of arguments)
- Names and descriptions for positional arguments
- Support level (1 indicates an internal intrinsic; higher numbers indicate less integral or externally supported operators)
- A type relation for the operator
- Other annotations useful when optimizing the operation.
Once again we add this to src/relay/op/tensor/transform.cc:

RELAY_REGISTER_OP("cumsum")

    .describe(

        R"doc(Return the cumulative sum of the elements along a given axis.)doc" TVM_ADD_FILELINE)

    .set_num_inputs(1)

    .add_argument("data", "Tensor", "The input tensor.")

    .set_support_level(3)

    .add_type_rel("Cumsum", ScanopRel)

    .set_attr<TOpPattern>("TOpPattern", kOpaque);

RELAY_REGISTER_OP("cumprod")

    .describe(

        R"doc(Return the cumulative product of the elements along a given axis.)doc" TVM_ADD_FILELINE)

    .set_num_inputs(1)

    .add_argument("data", "Tensor", "The input tensor.")

    .set_support_level(3)

    .add_type_rel("Cumprod", ScanopRel)

    .set_attr<TOpPattern>("TOpPattern", kOpaque);

在本例中，TOpPattern是对编译器的一个关于算子所执行的计算模式的提示，这对于融合算子可能很有用。kOpaque说明，TVM不要费心尝试融合这个算子。

4. Defining the Compute of the Operation

虽然现在已经为算子定义了接口，仍然需要定义如何执行累计和与积的实际计算。

编写此代码超出了本文的范围。现在，假设有一个经过良好测试的算子计算实现。有关如何执行此算子的更多详细信息，建议查阅有关张量表达式、TVM算子清单（topi）的教程，并查看python/TVM/topi/scan.py和python/TVM/topi/cuda/scan.py中的，gpu版本中的示例累积和与积实现。在累积和与积运算的情况下，直接在TIR中写入内容，这是张量表达式和topi将lower into降低到的表示形式。

5. Hooking up Compute and Strategy with Relay

实现了计算功能之后，现在需要粘到Relay算子上。在TVM中，不仅要定义计算，还要定义算子的调度。strategy是一种选择要使用的计算和调度的方法。例如，对于二维卷积，可能认识到正在进行深度卷积，分派到更高效的计算和调度。然而，在例子中，除了CPU和GPU实现之间的调度之外，没有这样的需求。在python/tvm/relay/op/strategy/generic.py和python/tvm/relay/op/strategy/cuda.py中，添加了以下策略：

def wrap_compute_scanop(topi_compute):

    """Wrap scanop style topi compute"""

    def _compute_scanop(attrs, inputs, _):

        return [topi_compute(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

    return _compute_scanop

@override_native_generic_func("cumsum_strategy")

def cumsum_strategy(attrs, inputs, out_type, target):

    """cumsum generic strategy"""

    strategy = _op.OpStrategy()

    strategy.add_implementation(

        wrap_compute_scanop(topi.cumsum),

        wrap_topi_schedule(topi.generic.schedule_extern),

        name="cumsum.generic",

    )

    return strategy

@override_native_generic_func("cumprod_strategy")

def cumprod_strategy(attrs, inputs, out_type, target):

    """cumprod generic strategy"""

    strategy = _op.OpStrategy()

    strategy.add_implementation(

        wrap_compute_scanop(topi.cumprod),

        wrap_topi_schedule(topi.generic.schedule_extern),

        name="cumprod.generic",

    )

    return strategy

@cumsum_strategy.register(["cuda", "gpu"])

def cumsum_strategy_cuda(attrs, inputs, out_type, target):

    """cumsum cuda strategy"""

    strategy = _op.OpStrategy()

    strategy.add_implementation(

        wrap_compute_scanop(topi.cuda.cumsum),

        wrap_topi_schedule(topi.cuda.schedule_scan),

        name="cumsum.cuda",

    )

    return strategy

@cumprod_strategy.register(["cuda", "gpu"])

def cumprod_strategy_cuda(attrs, inputs, out_type, target):

    """cumprod cuda strategy"""

    strategy = _op.OpStrategy()

    strategy.add_implementation(

        wrap_compute_scanop(topi.cuda.cumprod),

        wrap_topi_schedule(topi.cuda.schedule_scan),

        name="cumprod.cuda",

    )

    return strategy

在每个strategy中，定义了编写的计算和要在add_implementation()中使用的调度。最后，将strategy与python/tvm/relay/op/_transform.py中定义的Relay算子链接并进行计算：

# cumsum

@_reg.register_compute("cumsum")

def compute_cumsum(attrs, inputs, output_type):

    """Compute definition of cumsum"""

    return [topi.cumsum(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

_reg.register_strategy("cumsum", strategy.cumsum_strategy)

_reg.register_shape_func("cumsum", False, elemwise_shape_func)

# cumprod

@_reg.register_compute("cumprod")

def compute_cumprod(attrs, inputs, output_type):

    """Compute definition of cumprod"""

return [topi.cumprod(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

_reg.register_strategy("cumprod", strategy.cumprod_strategy)

_reg.register_shape_func("cumprod", False, elemwise_shape_func)

shape函数用于确定给定动态shape张量的输出shape。在这种情况下，告诉TVM输出shape将与输入shape相同。

6. Creating a Relay Call Node and Exposing a Python Hoo

现在有一个工作算子，现在只需要通过Relay正确地调用节点。这一步只需要编写一个函数，将参数作为Relay表达式。，传递给算子，并将调用节点返回给算子（即，应放置在RelayAST中的节点，在该节点中，算子将被调用）。

目前不支持调用属性和类型参数（最后两个字段），使用Op:：Get从算子注册表获取算子信息，并将参数传递给调用节点就足够了，如下所示。在src/relay/op/tensor/transform.cc中：

Expr MakeCumsum(Expr data, Integer axis, DataType dtype, Bool exclusive) {

    auto attrs = make_object<ScanopAttrs>();

    attrs->dtype = dtype;

    attrs->axis = axis;

    attrs->exclusive = exclusive;

    static const Op& op = Op::Get("cumsum");

    return Call(op, {data}, Attrs(attrs), {});

}

TVM_REGISTER_GLOBAL("relay.op._make.cumsum").set_body_typed(MakeCumsum);

Expr MakeCumprod(Expr data, Integer axis, DataType dtype, Bool exclusive) {

    auto attrs = make_object<ScanopAttrs>();

    attrs->dtype = dtype;

    attrs->axis = axis;

    attrs->exclusive = exclusive;

    static const Op& op = Op::Get("cumprod");

    return Call(op, {data}, Attrs(attrs), {});

}

TVM_REGISTER_GLOBAL("relay.op._make.cumsum").set_body_typed(MakeCumprod);

Where TVM_REGISTER_GLOBAL exposes the MakeCumsum and MakeCumprod functions in Python via relay.op._make.cumsum(...) and relay.op._make.cumsum(...).

7. Including a Cleaner Python API Hook

通常，Relay中的约定是，通过TVM_REGISTER_GLOBAL导出的函数，应该封装在单独的Python函数中，而不是直接在Python中调用。对于算子，在

python/tvm/relay/op/transform.py中公开了这个接口。

def cumsum(data, axis=None, dtype=None, exclusive=None):

    return _make.cumsum(data, axis, dtype, exclusive)

def cumprod(data, axis=None, dtype=None, exclusive=None):

    return _make.cumprod(data, axis, dtype, exclusive)

注意，这些Python包装器，也可能是向算子提供更简单接口的好机会。例如，concat算子，被注册为只使用一个算子，即一个具有要连接的张量的元组，但是Python包装器，将张量作为参数，并在生成调用节点之前，组合成一个元组：

def concat(*args):

    """Concatenate the input tensors along the zero axis.

    Parameters

    ----------

    args: list of Tensor

    Returns

    -------

    tensor: The concatenated tensor.

    """

    tup = Tuple(list(args))

    return _make.concat(tup)

8. Writing Unit Tests!

一些单元测试示例，可以在tests/python/relay/test_op_level3.py中找到，用于累积总和与乘积算子。

其它

梯度算子

梯度算子对于编写Relay中的可微程序非常重要。虽然Relay的autodiff算法可以区分一流的语言结构，但算子是不透明的。由于Relay无法查看实现，因此必须提供明确的差异化规则。

Python和C++都可以用来编写梯度算子，但是，例子集中在Python上，因为更常用。

在Python中添加梯度

Python梯度算子的集合可以在Python/tvm/relay/op/_tensor_grad.py中找到。将介绍两个具有代表性的示例：sigmoid和multiply。

@register_gradient("sigmoid")

def sigmoid_grad(orig, grad):

    """Returns [grad * sigmoid(x) * (1 - sigmoid(x))]."""

    return [grad * orig * (ones_like(orig) - orig)]

这里的输入是原始算子orig和要累加的梯度。返回的是一个列表，其中第i个索引处的元素是算子相对于算子第i个输入的导数。通常，梯度将返回一个列表，其中包含的元素数量与基本算子的输入数量相同。

进一步分析这个定义之前，首先应该回顾一下sigmoid函数的导数：

上面的定义看起来类似于数学定义，但有一个重要的补充，将在下面描述。

术语orig*（类似于（orig）-orig）直接匹配导数，因为这里的orig是sigmoid函数，但不只是对如何计算这个函数的梯度感兴趣。将这个梯度与其它梯度组合起来，这样就可以在整个程序中累积梯度。

这就是梯度术语的意义所在。在表达式grad*orig*（one_like（orig）-orig）中，乘以grad，表示如何使用到目前为止的梯度合成导数。

现在，考虑乘法，一个稍微有趣的例子：

@register_gradient("multiply")

def multiply_grad(orig, grad):

    """Returns [grad * y, grad * x]"""

    x, y = orig.args

    return [collapse_sum_like(grad * y, x),

            collapse_sum_like(grad * x, y)]

在本例中，返回的列表中有两个元素，因为multiply是一个二进制算子。回想一下，如果偏导数是

有一个乘法所需的步骤，对于sigmoid不是必需的，因为乘法具有广播语义。由于梯度的shape可能与输入的shape不匹配，使用collapse_sum_like来获取梯度grad * <var>项的内容，并使shape与要区分的输入的shape匹配。

Adding a Gradient in C++

在C++中添加一个梯度，类似于在Python中添加，但是，用于注册的接口略有不同。

首先，确保包含src/relay/transforms/pattern_utils.h。提供了用于在RelayAST中创建节点的 helper函数。然后，类似于Python示例的方式，定义梯度：

tvm::Array<Expr> MultiplyGrad(const Expr& orig_call, const Expr& output_grad) {

    const Call& call = orig_call.Downcast<Call>();

    return { CollapseSumLike(Multiply(output_grad, call.args[1]), call.args[0]),

             CollapseSumLike(Multiply(output_grad, call.args[0]), call.args[1]) };

}

注意，在C++中，不能使用Python中的算子重载，并且需要进行downcast，实现更加冗长。即使如此，也可以很容易地验证这个定义，是否反映了Python中的早期示例。

现在，不需要使用Python装饰器，而是需要在基础算子的注册末尾，添加一个对“FPrimalGradient”的set_attr调用，以便注册梯度。
```
RELAY_REGISTER_OP("multiply")
```
```
    // ...
```
```
    // Set other attributes
```
```
    // ...
```
```
    .set_attr<FPrimalGradient>("FPrimalGradient", MultiplyGrad);
```
参考链接：

https://tvm.apache.org/docs/dev/relay_add_op.html
人工智能芯片与自动驾驶
查看全文

相关阅读:
springMVC,spring,mybatis全注解搭建框架--第一步，让框架跑起来
 实现excel导入导出功能，excel导入数据到页面中，页面数据导出生成excel文件
 不带插件，自己写js，实现批量上传文件及进度显示
 excel转html 实现在线预览
 word和.txt文件转html 及pdf文件，使用poi jsoup itext心得
 实现图片旋转，滚动鼠标中间对图片放大缩小
 面试中常见问题之线程池与连接池的区别
 实例测试mysqlRR模式和RC模式各种锁情况
 分糖果
 MySQL试题

原文地址：https://www.cnblogs.com/wujianming-110117/p/15256482.html

在Relay中注册新TVM算子

3. Relating the Arity and Attributes to an Operation

4. Defining the Compute of the Operation

5. Hooking up Compute and Strategy with Relay

6. Creating a Relay Call Node and Exposing a Python Hoo

7. Including a Cleaner Python API Hook

8. Writing Unit Tests!

Adding a Gradient in C++