zoukankan      html  css  js  c++  java
  • 如何在 CPU 上优化 GEMM

    如何在 CPU 上优化 GEMM

    (TL;DR) TVM 提供抽象接口,允许用户分别描述算法和算法的实施组织(所谓的调度)。通常,在高性能调度中编写算法,会破坏算法的可读性和模块化。尝试各种看似有前途的调度也很耗时。在 TVM 的帮助下,可以有效地尝试这些调度,提高性能。

    将演示如何使用 TVM 优化矩阵乘法,通过简单地添加 18 行额外代码,实现比基线快 200 倍。

    在 CPU 上执行的密集计算应用程序,有两个重要的优化:

    1. 提高内存访问的缓存命中率。复杂的数值计算和热点内存访问,都可以通过高缓存命中率加速。需要将原始内存访问模式,转换为适合缓存策略的模式。
    2. SIMD(单指令多数据),或者称向量处理单元。每次都会处理一小批数据,不是单个网格。需要统一模式,转换循环体中的数据访问模式, LLVM 后端可以降低为 SIMD。

    实际上,所有方法都是repo 中提到的一个子技巧 。一些已被 TVM 抽象自动应用,由于 TVM 的限制,一些不能简单地应用。

    下面所有实验结果,都是在配备 Intel i7-4770HQ CPU 的,15' MacBook 上实现的。对于所有 x86 CPU,缓存行大小应为 64 字节。

    准备和基线

    将演示如何使用 TVM 优化矩阵乘法。在实际演示前,先定义这些变量。然后编写一个基线实现,这是在 TVM 中编写矩阵乘法的最简单方法。

    import tvm
    import tvm.testing
    from tvm import te
    import numpy
    import timeit
     
    # The size of the matrix
    # (M, K) x (K, N)
    # You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.
    M = 1024
    K = 1024
    N = 1024
     
    # The default tensor type in tvm
    dtype = "float32"
     
    # using Intel AVX2(Advanced Vector Extensions) ISA for SIMD
    # To get the best performance, please change the following line
    # to llvm -mcpu=core-avx2, or specific type of CPU you use
    target = "llvm"
    dev = tvm.device(target, 0)
     
    # Random generated tensor for testing
    a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)
    b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)
     
    np_repeat = 100
    np_runing_time = timeit.timeit(
        setup="import numpy
    "
        "M = " + str(M) + "
    "
        "K = " + str(K) + "
    "
        "N = " + str(N) + "
    "
        'dtype = "float32"
    '
        "a = numpy.random.rand(M, K).astype(dtype)
    "
        "b = numpy.random.rand(K, N).astype(dtype)
    ",
        stmt="answer = numpy.dot(a, b)",
        number=np_repeat,
    )
    print("Numpy running time: %f" % (np_runing_time / np_repeat))
     
    answer = numpy.dot(a.numpy(), b.numpy())
     
    # Algorithm
    k = te.reduce_axis((0, K), "k")
    A = te.placeholder((M, K), name="A")
    B = te.placeholder((K, N), name="B")
    C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")
     
    # Default schedule
    s = te.create_schedule(C.op)
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=1)
    print("Baseline: %f" % evaluator(a, b, c).mean)

    输出:

    Numpy running time: 0.009345
    Baseline: 3.291115

    在 TVM 中,检查较低级别的 IR,调试或优化调度。这是使用基线调度生成的 IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      for (m: int32, 0, 1024) {
        for (n: int32, 0, 1024) {
          C_2[((m*1024) + n)] = 0f32
          for (k: int32, 0, 1024) {
            C_2[((m*1024) + n)] = ((float32*)C_2[((m*1024) + n)] + ((float32*)A_2[((m*1024) + k)]*(float32*)B_2[((k*1024) + n)]))
          }
        }
      }
    }

    阻塞

    提高缓存命中率的一个重要技巧是阻塞——数据块将逐块计算。块内部的内存访问是一个具有高内存局部性的小邻域。选择了 32 作为分块因子。该块将填充 32 * 32 * sizeof(float) 即 4KB 的缓存,总大小为 32KB(L1 数据缓存)。

    bn = 32
    kfactor = 4
    s = te.create_schedule(C.op)
     
    # Blocking by loop tiling
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
    (kaxis,) = s[C].op.reduce_axis
    ko, ki = s[C].split(kaxis, factor=kfactor)
     
    # Hoist reduction domain outside the blocking loop
    s[C].reorder(mo, no, ko, ki, mi, ni)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    # By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops,
    # we can see big speedup compared with the baseline.
    evaluator = func.time_evaluator(func.entry_name, dev, number=10)
    print("Opt1: %f" % evaluator(a, b, c).mean)

    输出:

    Opt1: 0.310688

    这是阻塞后,生成的IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      for (m.outer: int32, 0, 32) {
        for (n.outer: int32, 0, 32) {
          for (m.inner.init: int32, 0, 32) {
            for (n.inner.init: int32, 0, 32) {
              C_2[((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)) + n.inner.init)] = 0f32
            }
          }
          for (k.outer: int32, 0, 256) {
            for (k.inner: int32, 0, 4) {
              for (m.inner: int32, 0, 32) {
                for (n.inner: int32, 0, 32) {
                  C_2[((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)) + n.inner)] = ((float32*)C_2[((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)) + n.inner)] + ((float32*)A_2[((((m.outer*32768) + (m.inner*1024)) + (k.outer*4)) + k.inner)]*(float32*)B_2[((((k.outer*4096) + (k.inner*1024)) + (n.outer*32)) + n.inner)]))
                }
              }
            }
          }
        }
      }
    }

    向量化

    另一个重要的技巧是向量化。当内存访问模式一致时,编译器可以检测到这种模式,将连续内存传递给向量处理器。在 TVM 中,可以使用vectorize接口,提示编译器这种模式,这样就可以大大加速。

    选择向量化内循环行数据,这是缓存友好的。

    s = te.create_schedule(C.op)
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
    (kaxis,) = s[C].op.reduce_axis
    ko, ki = s[C].split(kaxis, factor=kfactor)
     
    s[C].reorder(mo, no, ko, ki, mi, ni)
     
    # Vectorization
    s[C].vectorize(ni)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=10)
    print("Opt2: %f" % evaluator(a, b, c).mean)

    输出:

    Opt2: 0.341067

    向量化后,生成的 IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      for (m.outer: int32, 0, 32) {
        for (n.outer: int32, 0, 32) {
          for (m.inner.init: int32, 0, 32) {
            C_2[ramp((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)), 1, 32)] = broadcast(0f32, 32)
          }
          for (k.outer: int32, 0, 256) {
            for (k.inner: int32, 0, 4) {
              for (m.inner: int32, 0, 32) {
                C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)B_2[ramp((((k.outer*4096) + (k.inner*1024)) + (n.outer*32)), 1, 32)]))
              }
            }
          }
        }
      }
    }

    循环排列

    查看上面的 IR,可以看到 B 和 C 的内循环行数据,都进行了向量化。接下来,查看 A 的访问模式。在当前调度中,A 是逐列访问的,这对缓存不友好. 如果改变 ki 和内轴 mi 的嵌套循环顺序,A 矩阵的访问模式,对缓存更友好。

    s = te.create_schedule(C.op)
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
    (kaxis,) = s[C].op.reduce_axis
    ko, ki = s[C].split(kaxis, factor=kfactor)
     
    # re-ordering
    s[C].reorder(mo, no, ko, mi, ki, ni)
    s[C].vectorize(ni)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=10)
    print("Opt3: %f" % evaluator(a, b, c).mean)

    输出:

    Opt3: 0.111449

    循环排列后,生成的 IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      for (m.outer: int32, 0, 32) {
        for (n.outer: int32, 0, 32) {
          for (m.inner.init: int32, 0, 32) {
            C_2[ramp((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)), 1, 32)] = broadcast(0f32, 32)
          }
          for (k.outer: int32, 0, 256) {
            for (m.inner: int32, 0, 32) {
              for (k.inner: int32, 0, 4) {
                C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)B_2[ramp((((k.outer*4096) + (k.inner*1024)) + (n.outer*32)), 1, 32)]))
              }
            }
          }
        }
      }
    }

    阵列封装

    另一个重要的技巧是数组打包。诀窍是对多维数组的存储,进行重新排序,展平存储在一维内存中后,按顺序访问。

    可以使用数组打包,解决 B 的访问模式。观察扁平化后 B 的数组访问模式,在 K 维度上迭代时,这不是连续的。可以用维度 [K][N] 重新排序 B,使其具有维度 [N/bn][K][bn],bn 是阻塞因子,也是内循环中 B 的向量大小。这种重新排序,将 N 分成两个维度 — bigN (N/bn) 和 littleN (bn) —新维度 [N/bn][K][bn] 匹配 B,从外循环到内循环的索引(no, ko, ki, ni) ,在展平后,导致 B 的顺序访问模式。

    # We have to re-write the algorithm slightly.
    packedB = te.compute(
        (N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name="packedB"
    )
    C = te.compute(
        (M, N),
        lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k),
        name="C",
    )
     
    s = te.create_schedule(C.op)
     
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
    (kaxis,) = s[C].op.reduce_axis
    ko, ki = s[C].split(kaxis, factor=kfactor)
     
    s[C].reorder(mo, no, ko, mi, ki, ni)
    s[C].vectorize(ni)
     
    bigN, _, littleN = s[packedB].op.axis
    s[packedB].vectorize(littleN)
    s[packedB].parallel(bigN)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=10)
    print("Opt4: %f" % evaluator(a, b, c).mean)

    输出:

    Opt4: 0.217310

    阵列打包后,生成的IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
        for (bigN: int32, 0, 32) "parallel" {
          for (k: int32, 0, 1024) {
            packedB[ramp(((bigN*32768) + (k*32)), 1, 32)] = (float32x32*)B_2[ramp(((k*1024) + (bigN*32)), 1, 32)]
          }
        }
        for (m.outer: int32, 0, 32) {
          for (n.outer: int32, 0, 32) {
            for (m.inner.init: int32, 0, 32) {
              C_2[ramp((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)), 1, 32)] = broadcast(0f32, 32)
            }
            for (k.outer: int32, 0, 256) {
              for (m.inner: int32, 0, 32) {
                for (k.inner: int32, 0, 4) {
                  C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] = ((float32x32*)C_2[ramp((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.inner*1024)) + (k.outer*4)) + k.inner)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + (k.inner*32)), 1, 32)]))
                }
              }
            }
          }
        }
      }
    }

    块的写缓存

    阻塞后,程序将结果逐块写入C,访问模式不是顺序的。可以使用一个顺序缓存数组,保存块结果,在所有块结果准备好时,写入 C。

    s = te.create_schedule(C.op)
     
    # Allocate write cache
    CC = s.cache_write(C, "global")
     
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
     
    # Write cache is computed at no
    s[CC].compute_at(s[C], no)
     
    # New inner axes
    mc, nc = s[CC].op.axis
     
    (kaxis,) = s[CC].op.reduce_axis
    ko, ki = s[CC].split(kaxis, factor=kfactor)
    s[CC].reorder(ko, mc, ki, nc)
    s[CC].vectorize(nc)
     
    # TODO: Add separate optimization step to discuss loop unrolloing
    # unrolling is a loop optimization strategy which can reduce branch
    # prediction failures and increases the chance of concurrent execution
    # unroll kfactor loops
    s[CC].unroll(ki)
     
    bigN, _, littleN = s[packedB].op.axis
    s[packedB].vectorize(littleN)
    s[packedB].parallel(bigN)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=10)
    print("Opt5: %f" % evaluator(a, b, c).mean)

    输出:

    Opt5: 0.215912

    阻塞后,生成的IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global;
      allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global {
        for (bigN: int32, 0, 32) "parallel" {
          for (k: int32, 0, 1024) {
            packedB[ramp(((bigN*32768) + (k*32)), 1, 32)] = (float32x32*)B_2[ramp(((k*1024) + (bigN*32)), 1, 32)]
          }
        }
        for (m.outer: int32, 0, 32) {
          for (n.outer: int32, 0, 32) {
            for (m.c.init: int32, 0, 32) {
              C.global[ramp((m.c.init*32), 1, 32)] = broadcast(0f32, 32)
            }
            for (k.outer: int32, 0, 256) {
              for (m.c: int32, 0, 32) {
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[(((m.outer*32768) + (m.c*1024)) + (k.outer*4))], 32)*(float32x32*)packedB[ramp(((n.outer*32768) + (k.outer*128)), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 1)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 32), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 2)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 64), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 3)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 96), 1, 32)]))
              }
            }
            for (m.inner: int32, 0, 32) {
              for (n.inner: int32, 0, 32) {
                C_2[((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)) + n.inner)] = (float32*)C.global[((m.inner*32) + n.inner)]
              }
            }
          }
        }
      }
    }

    并行化

    可以利用多核处理器,进行线程级并行化。

    s = te.create_schedule(C.op)
     
    CC = s.cache_write(C, "global")
     
    mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
     
    s[CC].compute_at(s[C], no)
     
    mc, nc = s[CC].op.axis
     
    (kaxis,) = s[CC].op.reduce_axis
    ko, ki = s[CC].split(kaxis, factor=kfactor)
    s[CC].reorder(ko, mc, ki, nc)
    s[CC].vectorize(nc)
    s[CC].unroll(ki)
     
    # parallel
    s[C].parallel(mo)
     
    bigN, _, littleN = s[packedB].op.axis
    s[packedB].vectorize(littleN)
    s[packedB].parallel(bigN)
     
    func = tvm.build(s, [A, B, C], target=target, name="mmult")
    assert func
     
    c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
    func(a, b, c)
    tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)
     
    evaluator = func.time_evaluator(func.entry_name, dev, number=50)
    opt6_time = evaluator(a, b, c).mean
    print("Opt6: %f" % opt6_time)

    输出:

    Opt6: 0.066558

    并行化后,生成的IR。

    print(tvm.lower(s, [A, B, C], simple_mode=True))

    输出:

    primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
      attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
      buffers = {C: Buffer(C_2: Pointer(float32), float32, [1024, 1024], []),
                 A: Buffer(A_2: Pointer(float32), float32, [1024, 1024], []),
                 B: Buffer(B_2: Pointer(float32), float32, [1024, 1024], [])}
      buffer_map = {A_1: A, B_1: B, C_1: C} {
      allocate(packedB: Pointer(global float32x32), float32x32, [32768]), storage_scope = global {
        for (bigN: int32, 0, 32) "parallel" {
          for (k: int32, 0, 1024) {
            packedB[ramp(((bigN*32768) + (k*32)), 1, 32)] = (float32x32*)B_2[ramp(((k*1024) + (bigN*32)), 1, 32)]
          }
        }
        for (m.outer: int32, 0, 32) "parallel" {
          allocate(C.global: Pointer(global float32), float32, [1024]), storage_scope = global;
          for (n.outer: int32, 0, 32) {
            for (m.c.init: int32, 0, 32) {
              C.global[ramp((m.c.init*32), 1, 32)] = broadcast(0f32, 32)
            }
            for (k.outer: int32, 0, 256) {
              for (m.c: int32, 0, 32) {
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[(((m.outer*32768) + (m.c*1024)) + (k.outer*4))], 32)*(float32x32*)packedB[ramp(((n.outer*32768) + (k.outer*128)), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 1)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 32), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 2)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 64), 1, 32)]))
                C.global[ramp((m.c*32), 1, 32)] = ((float32x32*)C.global[ramp((m.c*32), 1, 32)] + (broadcast((float32*)A_2[((((m.outer*32768) + (m.c*1024)) + (k.outer*4)) + 3)], 32)*(float32x32*)packedB[ramp((((n.outer*32768) + (k.outer*128)) + 96), 1, 32)]))
              }
            }
            for (m.inner: int32, 0, 32) {
              for (n.inner: int32, 0, 32) {
                C_2[((((m.outer*32768) + (m.inner*1024)) + (n.outer*32)) + n.inner)] = (float32*)C.global[((m.inner*32) + n.inner)]
              }
            }
          }
        }
      }
    }

    总结

    仅用 18 行代码,应用上述简单优化后,生成的代码,可以使用 MKL实现numpy性能的60% 。输出反映了非排他性 Docker 容器上的运行时间,是不可靠的。强烈建议自己运行,观察 TVM 实现的性能提升。

    参考链接:

    https://tvm.apache.org/docs/tutorials/optimize/opt_gemm.html

    人工智能芯片与自动驾驶
  • 相关阅读:
    大华解码器二次开发/C#调用C++DLL
    C# 获取网站页面的句柄
    C# 字节数组 字符数组 字符串 Byte[] Char[] String
    C# 结构体数组 C++ DLL
    Django
    Djano
    Django
    数据分析 02 -Pandas
    数据分析-01 Numpy
    02-正则和xpath
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/15367859.html
Copyright © 2011-2022 走看看