What?
https://www.tutorialspoint.com/apache_nifi/index.htm
一个开源的数据萃取平台。
Apache NiFi is an open source data ingestion platform. It was developed by NSA and is now being maintained and further development is supported by Apache foundation. It is based on Java, and runs in Jetty server. It is licensed under the Apache license version 2.0. In this tutorial, we will be explaining the basics of Apache NiFi and its features.
http://nifi.apache.org/docs.html
Put simply, NiFi was built to automate the flow of data between systems. While the term 'dataflow' is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip].
The core concepts of NiFi
http://nifi.apache.org/docs.html
NiFi’s fundamental design concepts closely relate to the main ideas of Flow Based Programming [fbp]. Here are some of the main NiFi concepts and how they map to FBP:
NiFi Term FBP Term Description FlowFile
Information Packet
A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.
FlowFile Processor
Black Box
Processors actually perform the work. In [eip] terms a processor is doing some combination of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.
Connection
Bounded Buffer
Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure.
Flow Controller
Scheduler
The Flow Controller maintains the knowledge of how processes connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.
Process Group
subnet
A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components.
架构 - 支持集群
NiFi Architecture
NiFi executes within a JVM on a host operating system. The primary components of NiFi on the JVM are as follows:
- Web Server
The purpose of the web server is to host NiFi’s HTTP-based command and control API.
- Flow Controller
The flow controller is the brains of the operation. It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute.
- Extensions
There are various types of NiFi extensions which are described in other documents. The key point here is that extensions operate and execute within the JVM.
- FlowFile Repository
The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition.
- Content Repository
The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.
- Provenance Repository
The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.
NiFi is also able to operate within a cluster.
Starting with the NiFi 1.0 release, a Zero-Master Clustering paradigm is employed. Each node in a N
Flow配置示例
https://github.com/xmlking/nifi-examples
csv-to-json
This flow shows how to convert a CSV entry to a JSON document using ExtractText and ReplaceText.
decompression
This flow demonstrates taking an archive that is created with several levels of compression and then continuously decompressing it using a loop until the archived file is extracted out.
Getting Started with Apache NiFi
https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#downloading-and-installing-nifi
Apache NiFi User Guide
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
Tutorial
读取文件上传到mongo
https://dzone.com/articles/gentle-introduction-to-apache-nifi-for-dataflow-an
处理器
https://nifichina.github.io/general/GettingStarted.html#%E6%9C%89%E5%93%AA%E4%BA%9B%E7%B1%BB%E5%88%AB%E7%9A%84%E5%A4%84%E7%90%86%E5%99%A8
简单实战
https://www.cnblogs.com/h--d/p/10079418.html