  • Erlang ets -- something about cache continue

    上一次说到了实现一个简单cache 的基本思路和想法, http://www.cnblogs.com/--00/p/erlang_ets_something_about_cache.html 在文末, 说到了判断single record 内存占用量. 这次继续说说Erlang 数据项内存的相关问题.

    Erlang efficiency_guide 文档中, 较为清楚的表述了Erlang 系统中不同数据类型的内存消耗, 在这简单贴一两个:

    Small integer 1 word
    On 32-bit architectures: -134217729 < i < 134217728 (28 bits)
    On 64-bit architectures: -576460752303423489 < i < 576460752303423488 (60 bits)
    List 1 word + 1 word per element + the size of each element
    Atom 1 word. Note: an atom refers into an atom table which also consumes memory. The atom text is stored once for each unique atom in this table. The atom table is not garbage-collected.
    String (is the same as a list of integers) 1 word + 2 words per character

    从文档中,可以看出Small integer 占用了1个字节, Atom 占用1个字节, List 占用的字节主要取决于element amount 和 size of each element .


    ["123", "234"] 占用的内存量的计算 1 + (1 + (1 + 2 * 3)) + (1 + (1 + 2 * 3)) = 17 就是17 个字节.


      注意Atom 在Erlang 系统中只占用1 个word, 这一点对于Message 有很大的帮助.



    了解了Erlang 各种数据项在Erlang 系统中的内存分配规则,那么怎么才能快速的计算呢? 有没有现成的API函数, 总不能每次都手动计算一次吧?

    那就首先来看看Erlang 系统所提供的各种size:

    • 其中对于所有数据项都通用的有: erlang:external_size/1erts_debug:size/1erts_debug:flat_size/1

    • 适用于二进制串有: erlang:size/1erlang:byte_size/1erlang:bit_size/1

    • 适用于元组的有: erlang:size/1erlang:tuple_size/1

    其中,比较重要的erts_debug 两个函数:

    erts_debug:size/1 和 erts_debug:flat_size/1 都是不在正式文档中的函数, 可以用来计算erlang数据项在内存中所需要空间. 各种数据项的空间占用可以在这里找到: http://www.erlang.org/doc/efficiency_guide/advanced.html#id68912. 这两个函数区别在于, 在具有共享内存的数据结构中, erts_debug:size/1只计算一次共享的数据大小, 而erts_debug:flat_size/1则会重复计算.


    %% size(Term)
    %%  Returns the size of Term in actual heap words. Shared subterms are
    %%  counted once.  Example: If A = [a,b], B =[A,A] then size(B) returns 8,
    %%  while flat_size(B) returns 12.

    文档中有另外一个例子: http://www.erlang.org/doc/efficiency_guide/processes.html

    总的来说, erts_debug:size/1是erlang数据项在内存中所占用的空间大小, erts_debug:flat_size/1是同一节点内, 跨进程移动数据项(包括ETS操作)所需要拷贝的数据大小.

    OK, 先做个简单的test :

     1 $ cat test_for_ets_record_flat_size.erl 
     2 -module(test_for_ets_record_flat_size).
     4 -compile(export_all).
     6 start() ->
     7     A = ets:new(a, [named_table, public]),
     8     D = {[{} || _ <- lists:seq(1, 100)], [self() || _ <- lists:seq(1, 10)], [{<<"1234567890">>, {}} || _ <- lists:seq(1, 1000)]},
     9     io:format(" ** data words size ~p~n", [erts_debug:flat_size(D)]),
    10     io:format(" ** before insert ~p~n", [ets:info(A, memory)]),
    11     ets:insert(A, D),
    12     io:format(" ** after insert ~p~n", [ets:info(A, memory)]).


    1 $ erl
    2 Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
    4 Eshell V6.3  (abort with ^G)
    5 1> test_for_ets_record_flat_size:start().
    6  ** data words size 10324
    7  ** before insert 305
    8  ** after insert 10633
    9 ok

    从结果来看, flat_size 的方法差了4个字节. (erts_debug:size/1 能相差 8095 个字节) 为什么?

    erts_debug:size/1 只计算一次共享的数据大小, 而erts_debug:flat_size/1则会重复计算. 但为什么erts_debug:flat_size 还是会相差4个字节呢?

    带着这个疑惑去Google erlang erts_debug flat_size 找到了 Erlang-MailList  :

    After looking at this more I have realized the documentation of the memory information is correct as would be expected.  Sorry for the noise about this.  Some comment that talks about erts_debug:flat_size/1 (and erts_debug:size/1) providing the process heap size only, with an additional 1 word excluded for the register or stack storage of the top-level term would help make things clearer.  This may be straight-forward for some since it makes logical sense, but I didn't know about these internal details and I wanted to be sure I was looking at the size correctly.

    提到了erts_debug:flat_size ONLY 提供占用进程heap size .

    回过头看 源代码:

    Returns the size of Term in actual heap words.


    事情进展到这, 有一点已经搞明白了: erts_debug:flat_size/1 只能计算Erlang Term 在进程heap 中占用的内存, 并不能计算所有的内存占用量.然后通过上面的Erlang-MailList 找到了github 上的这个开源项目: erlang_term


     1 byte_size_term(Term, WordSize) ->
     2     DataSize = if
     3         is_binary(Term) ->
     4             BinarySize = erlang:byte_size(Term),
     5             if
     6                 BinarySize > 64 ->
     7                     BinarySize;
     8                 true ->
     9                     % in the heap size
    10                     0
    11             end;
    12         true ->
    13             0
    14     end,
    15     % stack/register size + heap size + data size
    16     (1 + erts_debug:flat_size(Term)) * WordSize + DataSize.

    从上面的代码可以看出, Erlang Term 的内存占用量应该是process heap 的内存占用量(通过erts_debug:flat_size/1 计算), stack 占用量以及共享内存占用量的总和.

    好, 继续上test code :

     1 $ cat test_for_ets_record.erl 
     2 -module(test_for_ets_record).
     4 -compile(export_all).
     6 start() ->
     7     A = ets:new(a, [named_table, public]),
     8     D = {[{} || _ <- lists:seq(1, 100)], [self() || _ <- lists:seq(1, 10)], [{<<"1234567890">>, {}} || _ <- lists:seq(1, 1000)]},
     9     io:format(" ** data words size ~p~n", [erlang_term:byte_size(D)/8]),
    10     io:format(" ** before insert ~p~n", [ets:info(A, memory)]),
    11     ets:insert(A, D),
    12     io:format(" ** after insert ~p~n", [ets:info(A, memory)]).


    1 $ erl -pa ./ebin -pa ./ 
    2 Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
    4 Eshell V6.3  (abort with ^G)
    5 1> test_for_ets_record:start(). 
    6  ** data words size 10325.0
    7  ** before insert 305
    8  ** after insert 10633
    9 ok

    why ?? 为什么还差3个字节呢? 好吧, 只能开 issues 请教作者了.

    So, the erlang_term module can help you manage caching, but the real situation in the Erlang VM with the many memory pools is more complex.

    没辙了, 要搞清楚这3个字节在ets table 中用到了什么地方, 就需要详细了解ets 的内存管理方式.只能先暂时搁置了(待续).


    1, erts_debug:flat_size/1 只计算了Erlang Term 在process heap 中的size ;

    2, erlang_term is so amazing .

