A pipeline is the top-level entity in the HERE platform that performs your data processing tasks. When you develop your own special purpose pipeline, multiple versions of it can be stored and managed by the HERE platform system (see Pipeline Version below). Developing a pipeline is an iterative process, and it is typical to upload new versions of the pipeline code for testing each pipeline with different input and output catalogs or settings. After testing, you can put the pipeline into production with live data.
Many elements go into making a pipeline. The following list defines the most important components.
Pipeline Template – This is the immutable definition of an executable pipeline version and its run-time properties on the platform. It is the starting point for creating an executable pipeline version (see below). The pipeline template defines the actual run-time implementation of the pipeline. It defines logical identifiers for the input and output catalogs it will use. One pipeline template can be used to create multiple pipeline versions. Each template is assigned a unique Template ID (UUID) by the pipeline when it is created. Pipelines and pipeline templates are independent entities. A pipeline template can be used by many pipeline versions. Pipelines and templates get their own HRNs that uniquely identify them on the platform. The pipeline template contains the JAR file as well as all the configuration information necessary to access, process, and store data.
Pipeline Version – This is an immutable entity that is the executable form of a pipeline within the HERE platform. Each pipeline version is created from a pipeline template that contains a JAR file. Each pipeline version is assigned its own Pipeline Version ID (UUID) by the pipeline when created. Multiple pipeline versions can be defined based on a single Pipeline JAR file. However, two instances of the same pipeline version (and Pipeline Version ID) cannot run at the same time.
Package – This is an immutable entity that represents a JAR file that has been uploaded onto the HERE platform. It contains compiled pipeline code and libraries. Because libraries are directly embedded in the JAR file, it is actually a Fat JAR file, but cannot exceed 500MB in file size. It includes metadata like the name of the file next to the actual binary artifact. The filename of the JAR file has a 200-character limit. The best practice for selecting a filename for the JAR file is to use a semantically meaningful name and include some kind of unique versioning system. The JAR file is identified within the pipeline template and is uploaded to the pipeline, where it is assigned a unique Package ID (UUID).
Job – A Job is an immutable entity that represents the one-time configuration parameters and input catalogs that are submitted to the cluster by a running pipeline version for that job only. When it specifies the exact versions of the input catalogs to be processed, that information overrides similar information specified in the pipeline template. It's possible to obtain the list of running or historical Jobs. Jobs have a state that may change over time as the Job runs and terminates.
Stream Job – For Stream processing, a Job is not intended to terminate by itself. These jobs typically have a continuing stream of input data for processing.
Batch Job – For Batch processing, a Job terminates by either succeeding or failing to process the available input data. These jobs typically are run on one or more finite collections of data.
Input and Output Catalogs – The input catalog is the data source for the pipeline, and the output catalog is the data destination from the pipeline. A pipeline can have more than one input catalog, but can only have one output catalog.
For a Stream pipeline, you may need to specify catalog versions, depending on the type of catalog layer used (that is, versioned, volatile, or streaming).
For a Batch pipeline, you can choose to run the pipeline immediately or schedule it to run when the input catalog data is updated. To run the pipeline immediately, you must specify the catalog versions. To schedule the pipeline mode, you do not need to specify the catalog versions. Instead, the pipeline scheduler checks the input and output catalogs every five minutes to capture changes and check for consistent versions for all the catalogs to be processed. For example, assume that the pipeline has two input catalogs and one has changes from an upstream catalog version 5 and the other input catalog also includes the same upstream catalog, but has not yet processed version 5. Thus, you cannot run the pipeline because the two input catalogs' versions are not consistent.
Batch jobs can be scheduled as:
There are other equally important components in the form of commands and configurations that enable you to create your customized pipeline.
Operation – These are special operational commands that can be submitted to a pipeline version. The pipeline version can accept them or not, typically because they may be invalid or another Operation is already pending. Operations have a state that can be checked to see if an Operation is still in progress or has been completed with a result.
Runtime Configuration – A set of parameters that can be specified at run-time to configure the pipeline's default run-time environment. The pipeline template specifies the default configuration parameters. However, some of these parameters may be respecified in specific job's configuration to override the default parameters of the template. In a pipeline version, the runtime configuration specifies the actual configuration passed to Jobs submitted for processing by that pipeline version. Custom configuration parameters are placed on the classpath of the pipeline, as
application.properties, which you can reference from within the pipeline code. For more information, see the Configuration File Reference.
Scheduler Configuration –
SchedulerConfig is a property used in each pipeline version. The scheduler controls when a Job is created and submitted to the Flink or Spark cluster for processing. The scheduler may start a new job when the previous one completes as expected, or not as expected. The Scheduler polls or waits for changes from upstream catalogs. It can also operate due to timers or other external triggers. It includes properties like when to start a Job, whether to restart terminated jobs, and polling intervals for upstream catalogs. For more information, see the Configuration File Reference.
Cluster Configuration (or ClusterConfig) – This is a property used with the pipeline template and pipeline version. In a pipeline template, it represents the suggested minimum size of the cluster needed to run a pipeline version based on that pipeline template. It represents the actual size and configuration of the cluster dedicated to the execution of that specific pipeline version. The configuration specifies properties like the size of the cluster in processing units of a fixed number of CPUs and of memory. The following table lists the cluster configuration parameters. For more details, see the Quotas & Limits article and the Configuration File Reference.
| ||Number of resource units per Supervisor (Flink JobManager or Spark Driver)|
| ||Number of resource units per Worker (Flink TaskManager or Spark Executor)|
| ||Number of Workers (number of Flink TaskManagers or Spark Executors)|
pipeline-config.conf – This file lists the parameters describing input catalogs, output catalog, and billing tag. The pipeline uses this information to determine whether the catalogs have changed and the scheduled batch pipeline should be run to process the changes. For more information, see the Configuration File Reference.
pipeline-job.conf – This configuration file lists the parameters describing the input catalog versions and processing type. For Batch pipelines that are run using the on-demand mode, the users can choose the default values or provide specific information. For more information, see the Configuration File Reference.