在介绍Ozone insight命令之前,我们先来了解下Ozone系统内所谓的Insight具体指的是什么呢?
- 关键服务的(实时)日志
- 关键服务的metric指标
- 关键服务的配置
具体的实现原理,笔者在之前的文章:如何提高分布式系统的可观察性:Insight Tool的引入描述过,感兴趣的同学可仔细阅读里面的细节实现,这里就不多加阐述了。
[hdfs@lyq bin]$ ./ozone insight -help
Unknown option: -elp (while processing option: '-help')
Usage: ozone insight [-hV] [--verbose] [-conf=<configurationPath>]
[-D=<String=String>]... [COMMAND]
Show debug information about a selected Ozone component
--verbose More verbose output. Show the stack trace of the errors.
-D, --set=<String=String>
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
list Show available insight points.
log, logs Show log4j events related to the insight point
metrics, metric Show available metrics.
config Show configuration for a specific subcomponents
然后在命令具体使用之前,我们要知道当前有哪些可用的insight point,insight point意为那些关键的服务点,例如关键线程服务,关键Protocol协议操作等等。
[hdfs@lyq bin]$ ./ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async event delivery
scm.protocol.block-location SCM Block location protocol endpoint
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
我们可以看到上面的insight point的粒度已经是非常细粒度的级别了。
下面我们来一一使用上面的3个子命令,首先是log命令,log这里会实时抓取目标insight point对应的日志类的log,如下为point scm.node-manager的日志获取:
[hdfs@lyq apache]$ ozone/bin/ozone insight log scm.node-manager
[SCM] 2019-12-13 21:04:46,966 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=lyq-xxx.com]
[SCM] 2019-12-13 21:05:14,998 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=lyq-xxx.com]
然后是metric指标的获取,这里的metric指标和我们平常在页面上通过jmx拿到的指标基本是一致的,不过在这里 通过不同的insight point其实是做了二次归类的。
[hdfs@lyq apache]$ ozone/bin/ozone insight metric om.protocol.client
Metrics for `om.protocol.client` (Ozone Manager RPC endpoint)
RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 2037
Sent bytes: 1760
RPC queue
RPC average queue time: 0.5
RPC call queue length: 0
RPC performance
RPC processing time average: 8.0
Number of slow calls: 0
Message type counters
Number of CreateVolume: 1
Number of SetVolumeProperty: 0
Number of CheckVolumeAccess: 0
Number of InfoVolume: 2
Number of DeleteVolume: 0
Number of ListVolume: 0
Number of CreateBucket: 0
Number of InfoBucket: 0
Number of SetBucketProperty: 0
Number of DeleteBucket: 0
Number of ListBuckets: 0
Number of CreateKey: 0
Number of LookupKey: 0
Number of RenameKey: 0
Number of DeleteKey: 0
Number of ListKeys: 0
Number of CommitKey: 0
Number of AllocateBlock: 0
Number of CreateS3Bucket: 0
Number of DeleteS3Bucket: 0
Number of InfoS3Bucket: 0
Number of ListS3Buckets: 0
Number of InitiateMultiPartUpload: 0
Number of CommitMultiPartUpload: 0
Number of CompleteMultiPartUpload: 0
Number of AbortMultiPartUpload: 0
Number of GetS3Secret: 0
Number of ListMultiPartUploadParts: 0
Number of ServiceList: 4
Number of DBUpdates: 0
Number of GetDelegationToken: 0
Number of RenewDelegationToken: 0
Number of CancelDelegationToken: 0
Number of GetFileStatus: 0
Number of CreateDirectory: 0
Number of CreateFile: 0
Number of LookupFile: 0
Number of ListStatus: 0
Number of AddAcl: 0
Number of RemoveAcl: 0
Number of SetAcl: 0
Number of GetAcl: 1
Number of PurgeKeys: 0
Number of ListMultipartUploads: 0
[hdfs@lyq bin]$ ./ozone insight config scm.replica-manager
Configuration for `scm.replica-manager` (SCM closed container replication manager)
default: 300s
current: 300s
When a heartbeat from the data node arrives on SCM, It is queued for processing with the time stamp of when the heartbeat arrived. There is a heartbeat processing thread inside SCM that runs at a specified interval. This value controls how frequently this thread is run.
There are some assumptions build into SCM such as this value should allow the heartbeat processing thread to run at least three times more frequently than heartbeats and at least five times more than stale node detection time. If you specify a wrong value, SCM will gracefully refuse to run. For more info look at the node manager tests in SCM.
In short, you don't need to change this.
default: 10m
current: 10m
Timeout for the container replication/deletion commands sent to datanodes. After this timeout the command will be retried.
上面config的命令输出信息提供了insight point相关的配置信息,对于用户来说还是十分友好的,不仅仅有当前值还有默认值的大小,以及配置的描述信息。
笔者在使用完这个工具后,不得不说Ozone实现的这套insight工具使用性还是很高的。其内部核心思想通过对关键服务设置insight point,然后对外暴露信息。
[2].https://issues.apache.org/jira/browse/HDDS-1935 . Improve the visibility with Ozone Insight tool