Hacking Nomad Job Dependencies

One of the oft-requested features for HashiCorp Nomad is job dependencies. I took some time this week to see what can be accomplished for simple workflows using Nomad-native features.

Most of the solutions for Nomad job dependencies involve external integrations with general-purpose frameworks like Apache Airflow. Frameworks like this and others support domain-specific mechanisms for describing workflows and managing their execution. They typically result in running a program which manages the lifecycle of the components by talking to Nomad’s API. The benefit of this is a powerful and portable syntax for specifying jobs. The cost, however, is the effort to develop and maintain these integration projects, as well as the runtime complexity associated with the resulting orchestrator-over-orchestrator pattern.

Nomad 1.0 introduced a large number of features. One of these, written by my friend and colleague Jasmine Dahilig, is support for post-stop tasks in the task group lifecycle. I wanted to see what it would take to allow users to specify simple workflow graphs, each in the form of a task group fragment and a list of edges to the next fragment, and using poststop tasks to run the next stage in the workflow.

My first inclination was to implement this using Levant. However, Nomad 1.0 recently introduced support for HCL2 parsing in job specifications. Unfortunately, it doesn’t seem that our current parsing supports including HCL from files (this support is currently being designed). Therefore, I returned to my original plan: to develop a generic Levant template that is capable of reading a workflow spec and rendering a Nomad job which implements the workflow using poststop hooks.

The result is that, for simple graphs, it was relatively straightforward. The Levant template takes a list of fragments and maps each to a task group. In addition to the content in the task group, each stage receives an additional poststop task, which is responsible for starting the next stages (as specified in the dependency graph). Executing stages is performed by scaling their group count (initially set to zero in the template) up to 1. This is implemented by a simple executable which for simplicity I packaged as a Docker image.

The result is that you can specify your workflow in a compact, minimal representation, as a list of fragments and a dependency graph between fragments (expressed here in JSON):

{
  "job_id": "dags",
  "job_name": "you like dags? 🐕",
  "fragments": {
    "a": "frag-a.hcl",
    "b": "frag-b.hcl",
    "c": "frag-c.hcl",
    "d": "frag-d.hcl"
  }, 
  "graph": {
    "": ["a"],
    "a": ["b"],
    "b": ["c","d"],
    "c": [],
    "d": []
  }     
}

This specification is passed to Levant as an input file and used to render the full Nomad job:

> levant deploy -var-file=your-job-goes-here.json builder.tmpl

Here is an animation from the Nomad UI, running the simple graph specified above:

The current implementation is a simple proof of concept; it’s missing a number of features and safeties that you would normally want. For example, it doesn’t do any checking that groups in the same stage are complete. Nor does it make any attempt to enforce that the dependency graph is a DAG; as a result, it’s possible to run jobs with cycles.

I’ve published a repo containing the template, the poststart task, and a few examples. I’m not planning to do anything else with this. In fact, I’ve already spend more time writing this post than I did implementing the POC. I just wanted to showcase the Nomad task lifecycle system and Levant, and how easy it is to use these tools to simplify specifying and running workloads on Nomad.

Hacking Nomad Job Dependencies

Demo animation for simple DAG.

Chris Baker