zoukankan      html  css  js  c++  java
  • Why for-range behave differently depending on the size of the element

    原文地址

    https://labs.yulrizka.com/en/why-for-range-behave-differently-depending-on-the-size-of-the-element/

    package main
    
    import "testing"
    
    const size = 1000000
    
    type SomeStruct struct {
        ID0 int64
        ID1 int64
        ID2 int64
        ID3 int64
        ID4 int64
        ID5 int64
        ID6 int64
        ID7 int64
        ID8 int64
    }
    
    func BenchmarkForVar(b *testing.B) {
        slice := make([]SomeStruct, size)
        b.ReportAllocs()
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            for _, s := range slice { // index and value
                _ = s
            }
        }
    }
    func BenchmarkForCounter(b *testing.B) {
        slice := make([]SomeStruct, size)
        b.ReportAllocs()
        b.ResetTimer()
        for i := 0; i < b.N; i++ {
            for i := range slice { // only use the index
                s := slice[i]
                _ = s
            }
        }
    }

    基准测试结果

    $ go test -bench .
    goos: linux
    goarch: amd64
    BenchmarkForVar-4               4363        269711 ns/op           0 B/op           0 allocs/op
    BenchmarkForCounter-4           4195        285952 ns/op           0 B/op           0 allocs/op
    PASS
    ok      _/test1    2.685s

    并没有太大差异,但是当我们稍微改一下SomeStruct结构

    type SomeStruct struct {
        ID0 int64
        ID1 int64
        ID2 int64
        ID3 int64
        ID4 int64
        ID5 int64
        ID6 int64
        ID7 int64
        ID8 int64
        ID9 int64  
    }

    再进行基准测试

    $ go test -bench .
    goos: linux
    goarch: amd64
    BenchmarkForVar-4                282       4264872 ns/op           0 B/op           0 allocs/op
    BenchmarkForCounter-4           4363        269761 ns/op           0 B/op           0 allocs/op
    PASS
    ok      _/test1    3.255s

    为什么?问题大概是出在range上,看下汇编。

    为了容易看汇编,我们搞一个main.go

    package main
    
    func main() {
        const size = 1000000
    
        slice := make([]SomeStruct, size)
        for _, s := range slice {
            _ = s
        }
    }
    go tool compile -S main.go type.go | grep -v FUNCDATA | grep -v PCDATA

    第一个版本的SomeStruct

    "".main STEXT size=93 args=0x0 locals=0x28
        ...
        0x0024 00036 (main_var.go:6)    MOVQ    AX, (SP)
        0x0028 00040 (main_var.go:6)    MOVQ    $1000000, 8(SP)
        0x0031 00049 (main_var.go:6)    MOVQ    $1000000, 16(SP)
        0x003a 00058 (main_var.go:6)    CALL    runtime.makeslice(SB)
        0x003f 00063 (main_var.go:6)    XORL    AX, AX       # set AX = 0
        0x0041 00065 (main_var.go:7)    INCQ    AX           # AX++
        0x0044 00068 (main_var.go:7)    CMPQ    AX, $1000000 # AX < 1000000
        0x004a 00074 (main_var.go:7)    JLT    65               # LOOP
        ...

    第二个版本

    0x0000 00000 (main_var.go:3)    TEXT    "".main(SB), ABIInternal, $120-0
        ...
        0x0044 00068 (main_var.go:6)    XORL    CX, CX # CX = 0
        0x0046 00070 (main_var.go:7)    JMP    76
        0x0048 00072 (main_var.go:7)    ADDQ    $80, AX
        0x004c 00076 (main_var.go:7)    PCDATA    $0, $2 # setup temporary variable autotmp_7
        0x004c 00076 (main_var.go:7)    LEAQ    ""..autotmp_7+32(SP), DI
        0x0051 00081 (main_var.go:7)    PCDATA    $0, $3
        0x0051 00081 (main_var.go:7)    MOVQ    AX, SI
        0x0054 00084 (main_var.go:7)    PCDATA    $0, $1
        0x0054 00084 (main_var.go:7)    DUFFCOPY    $826 # copy content of the struct
        0x0067 00103 (main_var.go:7)    INCQ    CX
        0x006a 00106 (main_var.go:7)    CMPQ    CX, $1000000
        0x0071 00113 (main_var.go:7)    JLT    72
        ...

    明显看到多了一个DUFFCOPY,其实压根都不用思考就知道是这边的问题了。

    当然也可以看下SSA

    GOSSAFUNC=main go tool compile -S main_var.go type_small.go

    对比两个ssa

    你说咋优化呢,附上俩

        for i := 0; i < b.N; i++ {
            for i := range slice { // only use the index
                s := slice[i]
                _ = s
            }
        }
            for i := 0; i < len(slice); i++ {
                s := slice[i]
                _ = s
            }

    一般情况我们不会单独测,都是性能测试,比如pprof trace这类

    When profiling my app, and run top, I see

    Showing top 10 nodes out of 31 (cum >= 0.12s)
          flat  flat%   sum%        cum   cum%
        13.93s 63.00% 63.00%     13.93s 63.00%  runtime.duffcopy







    end
    一个没有高级趣味的人。 email:hushui502@gmail.com
  • 相关阅读:
    How to Compile Java DBus
    BZOJ 2783 JLOI 2012 树 乘+二分法
    Robotium原则的实施源代码分析
    基本的负载均衡算法
    人大、上财、复旦、上交四校2013年应届金融硕士就业去向
    2014届上财金融硕士就业情况
    三跨),总分420+
    复旦金融专硕和上财金融专硕
    一个三跨考生三战上海财经大学金融硕士的考研经验
    董某某
  • 原文地址:https://www.cnblogs.com/CherryTab/p/12777727.html
Copyright © 2011-2022 走看看