zoukankan      html  css  js  c++  java
  • 【转载】 谷歌集群数据分析 clusterdata-2011-2

    原文地址:

    https://www.twblogs.net/a/5c2dc304bd9eee35b21c418b/zh-cn

     ------------------------------------------------------------------------------------------------

    本篇主要是解析数据集clusterdata-2011-2

    by ——https://github.com/google/cluster-data

    dataset的说明文档:https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view

    数据集描述:The clusterdata-2011-2 trace represents 29 day's worth of cell information from May 2011, on a cluster of about 12.5k machines.

    将csv文件导入到MySQL中的各表信息如下:(表结构在末尾)

    job event表:

    row1672923   286.86 MB (300,795,688)

    index:jobid,btree 35.56 MB (37,285,888)

    machine events表:

    row:37780    2.99 MB (3,138,540)

    machine attribute:

    row:10748566    1.09 GB (1,175,642,124)

    task constrains:

    row:28485619    2.95 GB (3,163,127,240)

    task usage:

    row:1232799308    182.55 GB (196,015,089,972)

    index:69.61 GB (74,743,799,808)

    machineid(btree) jobid(btree)

    task event:(导入数据有点问题,正在处理)

    row:144648292 12.76 GB (13,700,652,148)

    index: 6.90 GB (7,414,187,008)

    machineid,jobid,username

    explain part1:字段

    explain part2:表格

    part1.字段

    一个job包含多个task,每一个task表示一个Linux项目,可能有多个进程。

    timestamp:以微秒为单位,在日志开始前600s开始计时(如20s开始的时间为620s)

                       0时刻的记录代表在日志记录之前发生的事件,因为作业可能在日志记录之前被提交。

                       2的63次方-1的时间为日志记录结束之后的事件。

    job和machine的ID不会被复用,可以当作唯一表识。(machineID重复可能是由于一个机器被移除集群后又重新加了进来,jobID重复可能是一个job被停止然后配置重新启动)

    user和job的name被hash了,为了保密以及测试时相同。

    machine event type:0.add 1.remove 2.update

    job和task的event type:0.submit 1.schedule 2.evict 3.fail 4.kill 5.finish 6.lost 7.update_pending 8.update_running

    priority:0为最低的

    infrastructure (11)—this is the highest (most entitled to get resources) priority in the trace and accounts for most of the recorded disk I/O, so we speculate it includes some storage services;
    monitoring (10)
    normal production (9)—this is the lowest (and most occupied) of the priorities labeled ‘production’. The trace providers indicate that jobs at this priority and higher which are latency-sensitive should not be “evicted due to over-allocation of machine resources” .
    other (2-8) — we speculate that these priorities are dominated by batch jobs; 
    gratis (free) (0-1) — the trace providers indicate that resources used by tasks at these priorities are generally not charged.
     

    missing info:正常数据为NULL,丢失数据为0-2.

    0.SNAPSHOT_BUT_NO_TRANSITION:we did not find a record representing the given event, but a later snapshot of the job or task state indicated that the transition must have occurred. The timestamp of the synthesized event is the timestamp of the snapshot.

    1.NO_SNAPSHOT_OR_TRANSITION : we did not find a record representing the given termination event, but the job or task disappeared from later snapshots of cluster states, so it must have been terminated. The timestamp of the synthesized
    event is a pessimistic upper bound on its actual termination time assuming it could have legitimately been missing from one snapshot.
    2.EXISTS_BUT_NO_CREATION : we did not find a record representing the creation of the given task or job. In this case, we may be missing metadata (job name, resource requests, etc.) about the job or task and we may have placed SCHEDULE or SUBMIT events latter than they actually are.

     
     
     

    scheduleclass,该类粗略地表示作业的延迟敏感程度。调度类型由一个数字表示,3表示一个对延迟比较敏感的作业,0表示一个非生产任务(例如:非关键业务分析等)

     comparison operator:??

    怎么比的不明白。。。

    小于(2),大于(3):将机器属性表示为整数(或0,如果属性不存在),然后将其与提供的属性值进行比较。这些比较严格小于和严格大于;等于(0),不等于(1):机器属性表示为字符串(或空字符串如果它不存在的话),然后比较所提供的属性值。(翻译文档)

    part2:

    table:

    1.Machine events
    Each machine is described by one or more records in the machine event table. The majority of records describe machines that existed at the start of the trace.
    1. timestamp
    2. machine ID
    3. event type
    4. platform ID
    5. capacity: CPU
    6. capacity: memory

    2.job event&task event

    The two event tables describe jobs/tasks and their lifecycles. The constraints table describes task placement constraints that restrict the machines onto which tasks can schedule.

    The simplest case is shown by the top path in the diagram above: a job is SUBMITted and gets put into a pending queue; soon afterwards, it is SCHEDULEd onto a machine and starts running; some time later it FINISHes successfully.

    先提交(0),然后进队(1),之后完成(4)

    3.task usage

    这篇博客详细解释了https://blog.csdn.net/yangss123/article/details/78298749

    生成的中间表有

    分别是各平台内包含的机器id,以及所有中等优先级的task(priority为2-8),以及所有成功进入队列的task(event type为1)的表,并建立相应的索引。(使用中间表后,检索时间由数小时级别下降到1min以内)

    ----------------------------------------------------------------------------------

  • 相关阅读:
    每个程序员都应该了解的内存知识
    关于CPU Cache -- 程序猿需要知道的那些事
    【转载】十分钟搞清字符集和字符编码
    初学 Java Web 开发,请远离各种框架,从 Servlet 开发
    XML
    接口比对象更加抽象
    【转载】Dom4j的使用(全而好的文章)
    BZOJ4870:[SHOI2017]组合数问题(组合数学,矩阵乘法)
    BZOJ1089:[SCOI2003]严格n元树(DP,高精度)
    BZOJ1259:[CQOI2007]矩形rect(DFS)
  • 原文地址:https://www.cnblogs.com/devilmaycry812839668/p/10898161.html
Copyright © 2011-2022 走看看