
State Transition Details


Job Creation

[!NOTE] A new job id will be written in 3 places

When applying a change to the job queue, the following sequence should be used

Locking Batches and Jobs

Locks on Jobs and Batches should be implemented with a Zookeeper ephemeral lock. If a zookeeper client process terminates, ephemeral locks are released.

Acquiring a Batch

/batches/BID/lock: #ephemeral node to be held by a consumer daemon

Acquiring a Job

/jobs/JID/lock: #ephemeral node to be held by a consumer daemon

Consumer Daemons to Create

Batch: API Call Triggers the Creation and Queuing of a Batch

User submits a submission payload.
A batch is created using the payload url. Regardless of the type of submission, the payload should be represented as a URL. This step should be as lightweight as possible.


submitter: submitter
type: file # container or file in the case of a zip deposit
profile: profile
payload_url: payload_url
- file1.checkm loc001
- file2.checkm loc002 
- file3.checkm loc003 ark123

ZK Nodes

  profile_name: profile_name
  submitter: submitter
  payload_url: payload_url
  erc_what: title
  erc_who: author
  erc_when: date
  type: file
  submission_mode: add
  status: pending
  last_modified: now

Batch: Acquire Pending Batch

Identifying Pending Batches

Create Lock

/batches/bid001/lock: #ephemeral

State Description

If a Collection Hold is in place, change status to Held and stop processing.

The differences in batch submission types (single file, object manifest, manifest of manifests, manifest of containers) should be handled at this phase. One job will be spawned for each object that needs to be created for the payload.

If configured in the profile, a summary email should be sent to the depositor confirming the queueing of the batch of jobs.

Batch: Pending to Held

If the collection is in a held state, the batch should move to a held status. An administrative action is necessary to release the hold.

  status: held 
  last_modified: now
# DELETE /batches/bid001/lock

Batch: Pending –> Processing

  status: processing
  last_modified: now
# DELETE /batches/bid001/lock
```### Output

#### Job Details
  batch_id: bid0001
  profile_name: profile_name
  submitter: submitter
  payload_url: file1.checkm
  payload_type: object_manifest
  response_type: response_type
  response_type: tbd 
  submission_mode: add
  working_dir: /zfs/queue/bid0001/jid0001
  local_id: [loc001]
  status: pending
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0
/jobs/jid0001/priority: 5
  batch_id: bid0001
  profile_name: profile_name
  submitter: submitter
  payload_url: file2.checkm
  payload_type: object_manifest
  response_type: response_type
  response_type: tbd
  submission_mode: add
  working_dir: /zfs/queue/bid0001/jid0002
  local_id: loc002
  local_id: [loc002]
/jobs/jid0002/status: status: 
  status: pending
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0
/jobs/jid0002/priority: 5
  batch_id: bid0001
  profile_name: profile_name
  submitter: submitter
  payload_url: file2.checkm
  payload_type: object_manifest
  response_type: response_type
  response_type: tbd 
  submission_mode: add
  working_dir: /zfs/queue/bid0001/jid0003
  primary: ark123
  local_id: [loc003]
/jobs/jid0003/status: status: 
  status: pending
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0
/jobs/jid0003/priority: 5

Place jobs in job queue, allowing sorting by priority

/jobs/states/pending/05-jid0001: #no data - acts as a reference
/jobs/states/pending/05-jid0002: #no data - acts as a reference
/jobs/states/pending/05-jid0003: #no data - acts as a reference

Place jobs references in batch queue

/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference
/batches/bid0001/states/batch-processing/jid0002: #no data - acts as a reference
/batches/bid0001/states/batch-processing/jid0003: #no data - acts as a reference

Batch: Held –> Pending (Admin Action)

An administrative action is performed to release a “Held” batch.
After confirming that the target collection is no longer “Held”, proceed to the Processing step.

  status: pending 
  last_modified: now

The Job Queue

The Job Queue runs independently from the Batch Queue

Job: Acquire Pending Job

Create Lock

/jobs/jid0001/lock: #ephemeral

Job: Pending –> Failed

A job will immediately fail under the following conditions

Recovery is not possible under these conditions. A new submission will be required.

  status: failed
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/pending/05-jid0001:
/jobs/states/failed/05-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Pending –> Held

The job will be kept in a Held state until an administrative action releases the job.

  status: held
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/pending/05-jid0001:
/jobs/states/held/05-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Pending –> Estimating

Once a job is acquired, it will move to an Estimating step.


  status: estimating
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/pending/05-jid0001:
/jobs/states/estimating/05-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Held –> Pending (Admin Action)

Job is administratively released back to a Pending status.


  status: pending
  last_successful_status: #nil
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/held/05-jid0001:
/jobs/states/pending/05-jid0001: #no data - acts as a reference

Job: Acquire Estimating Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

The first step of a job is to estimate the resources that will be needed to process the job. This will be accomplished by running HEAD reqeusts for content to be ingested and calculating a size estimate for the object. If a job is excessively large , the job priority may be adjusted.

The estimating step does not fail. If a proper size calculation cannot be made for a job, the space_needed should be set to 0 and job priority may be adjusted.


/jobs/jid0002/space_needed: 1000000000
/jobs/jid0002/priority: 10

Job: Estimating –> Provisioning

  status: provisioning
  last_successful_status: estimating
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/estimating/05-jid0001:
/jobs/states/provisioning/10-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Acquire Provisioning Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

The Provisioning state will be used to determine if there are sufficient system resources for a job to procede. At the simplest level, this state would allow us to throttle all subsequent ingests if our ZFS capacity is insufficient to support a specific download. Unestimated jobs should be held in this state if the ZFS capacity is below a specific threshold.

Additionally, this state could be used to hold a job while resources are dynamically provisioned from AWS. This will not be a feature of the initial release.

Jobs that fail the provisioning test will remain in this state, so it is important that ALL jobs in this state get evaluated. If some jobs are retained in the provisioning state, it might make sense for the provisioning thread to sleep between tests.

Job: Provisioning –> Downloading

  status: downloading
  last_successful_status: provisioning
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/provisioning/10-jid0001:
/jobs/states/downloading/10-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Acquire Downloading Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

The Downloading step performs the following actions

Ouptut (if changes detected)

/jobs/jid0001/space_needed: 1000000000
/jobs/jid0001/priority: 10

Job: Downloading –> Processing

  status: processing
  last_successful_status: downloading
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/downloading/10-jid0001:
/jobs/states/processing/10-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Downloading –> Failed (downloading)

If any individual download does not succeed (after a set number of retries), the job will go to a failed state.

  status: failed
  last_successful_status: provisioning # retain_prior_value
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/provisioning/10-jid0001:
/jobs/states/failed/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Acquire Processing Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

The processing step is where the bulk of Merritt Ingest processing takes place


  primary: 555
  local_id: [loc002]

Job: Processing –> Recording

  status: recording
  last_successful_status: processing
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/processing/10-jid0001:
/jobs/states/recording/10-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Processing –> Failed (processing)

Jobs may fail processing due to minting failure or storage failures.

  status: failed
  last_successful_status: downloading # retain_prior_value
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/processing/10-jid0001:
/jobs/states/failed/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Acquire Recording Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

This step will be processed by the Merritt Inventory service.

This will satisfy one of the key motivations for the queue redesign effort.
By processing the inventory step from the ingest queue, the depositor notification process will ensure that content is immediately accessible from Merritt. Previously, it was possible that depositors were notified of a successful ingest BEFORE content had been recorded in inventory.

Job: Recording –> Notify

  status: notify
  last_successful_status: recording
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/recording/10-jid0001:
/jobs/states/notify/10-jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Recording –> Failed (recording)

This status change indicates that an error occurred while recording an object change in the inventory database.

  status: failed
  last_successful_status: processing # retain_prior_value
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/recording/10-jid0001:
/jobs/states/failed/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Acquire Notify Job

Create Lock

/jobs/jid0001/lock: #ephemeral

State Description

If a callback has been configured in a collection profile, the callback will be invoked for the job. As the status of the job is changed to “completed”, the batch object for the job will be notified of the update (potentially via a Zookeeper “Watcher”). This will allow the batch to determine if the entire job has been completed.

Job: Notify –> Completed

  status: completed
  last_successful_status: notify
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/notify/10-jid0001:
/jobs/states/completed/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-completed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Job: Notify –> Failed

If the event of a callback failure, the job will go to a Failed state.

  status: failed
  last_successful_status: recording # retain_prior_value
  last_modification_date: now
  retry_count: 0 # no change
# DELETE /jobs/states/notify/10-jid0001:
/jobs/states/failed/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-processing/jid0001:
/batches/bid0001/states/batch-failed/jid0001: #no data - acts as a reference
# DELETE /jobs/jid0001/lock

Resuming failed jobs

The failed job will be resumed via an admin action. The resumed job will restart at an appropriate state based on the “last_successful_state”.

Job: Failed –> Downloading (Admin Action)

  status: downloading
  last_successful_status: provisioning # no change
  last_modification_date: now
  retry_count: 1 # increment by 1
# DELETE /jobs/states/failed/10-jid0001:
/jobs/states/downloading/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-failed/jid0001:
/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference

Job: Failed –> Processing (Admin Action)

  status: processing
  last_successful_status: downloading # no change
  last_modification_date: now
  retry_count: 1 # increment by 1
# DELETE /jobs/states/failed/10-jid0001:
/jobs/states/processing/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-failed/jid0001:
/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference

Job: Failed –> Recording (Admin Action)

  status: recording
  last_successful_status: processing # no change
  last_modification_date: now
  retry_count: 1 # increment by 1
# DELETE /jobs/states/failed/10-jid0001:
/jobs/states/recording/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-failed/jid0001:
/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference

Job: Failed –> Notify (Admin Action)

  status: notify
  last_successful_status: processing # no change
  last_modification_date: now
  retry_count: 1 # increment by 1
# DELETE /jobs/states/failed/10-jid0001:
/jobs/states/notify/10-jid0001: #no data - acts as a reference
# DELETE /batches/bid0001/states/batch-failed/jid0001:
/batches/bid0001/states/batch-processing/jid0001: #no data - acts as a reference

Job: Completed –> DELETED (Automated Task)

Upon completion of the job, the job’s ZFS working directory (producer AND system) can be deleted.

Other job-related data will be retained in zookeeper to facilitate reporting.

# DELETE /jobs/states/completed/10-jid0001:

Job: Failed –> DELETED (Admin Action)

If the batch is not yet completed, confirm that the user understands that job deletion will prevent notification of job-related information.

Upon deletion of a failed job, the job’s zookeeper nodes and the ZFS working directory can be deleted.

# DELETE /jobs/jid0001/configuration:
# DELETE /jobs/jid0001/status: 
# DELETE /jobs/jid0001/priority: 
# DELETE /jobs/jid0001/ark: 
# DELETE /jobs/states/failed/10-jid0001:
# DELETE /batches/bid0001/states/batch-failed/jid0001:

Job: Held –> DELETED (Admin Action)

If the batch is not yet completed, confirm tha tthe user understands that job deletion will prevent notification of job-related information.

Upon completion of a held job, the job’s zookeeper nodes and the ZFS working directory can be deleted.

# DELETE /jobs/jid0001/configuration:
# DELETE /jobs/jid0001/status: 
# DELETE /jobs/jid0001/priority: 
# DELETE /jobs/jid0001/ark: 
# DELETE /jobs/states/held/10-jid0001:
# DELETE /batches/bid0001/states/batch-processing/jid0001:

Batch: Processing –> Reporting (Automated by event)

Once the last job for a batch has either failed or completed, the batch will move to a reporting step.

# NOTE the absence of /batches/bid0001/states/batch-processing/*:
# NOTE check for the presence of /batches/bid0001/states/batch-failed/*:
# NOTE check for the presence of /batches/bid0001/states/batch-completed/*:
  status: reporting
  last_modified: now

Batch: Acquire Reporting Batch

Create Lock

/batches/bid0001/lock: #ephemeral

State Description

The reporting phase will gather a list of completed jobs for a batch and failed jobs for a batch.
This will be compiled into a report for the depositor.

The list of failed jobs should be saved to a zookeeper node so their status can be re-evaluated for a subsequent report.


  last_modified: now
  # array of jids
  # array of jids

Batch: Reporting –> Completed

  status: completed
  last_modified: now
# DELETE /batches/bid0001/lock

Batch: Reporting –> Failed

The reporting phase will gather a list of completed jobs for a batch and failed jobs for a batch.
This will be compiled into a report for the depositor.

  status: failed
  last_modified: now
# DELETE /batches/bid0001/lock

Batch: Failed –> UpdateReporting (Admin Action)

This status change will be triggered by an administrative action. This action indicates that attempts to troubleshoot failed jobs for a batch have concluded.

  status: update-reporting
  last_modified: now

Batch: Acquire Update Reporting Batch

Create Lock

/batches/bid0001/lock: #ephemeral

State Description

A subsequent report will be sent to the depositor indicating jobs that succeeded since the last report was sent.

It might make sense to also indicate the jobs that were not resolved since the prior report was sent.


  last_modified: now
  # array of jids
  # array of jids

Batch: UpdateReporting –> Completed

A subsequent report will be sent to the depositor indicating jobs that succeeded since the last report was sent.

  status: completed
  last_modified: now
# DELETE /batches/bid0001/lock

Batch: UpdateReporting –> Failed

  status: failed
  last_modified: now
# DELETE /batches/bid0001/lock

Batch: Failed –> DELETED (Admin Action)

An administrative action will trigger the delete of a failed batch (and any outstanding jobs for that batch).

This action should only be taken once all attempts at job recovery have been exhausted.

# DELETE /batches/bid0001/status: 
# DELETE /batches/bid0001/status-report: 
# DELETE /batches/bid0001/submission:
# for every JID in /batches/bid0001/states/batch-*/*:
#   DELETE /batches/bid0001/states/batch-completed/JID:
#   DELETE /jobs/states/STATE/*-JID if present
#   DELETE /jobs/JID/configuration:
#   DELETE /jobs/JID/status: 
#   DELETE /jobs/JID/priority: 
#   DELETE /jobs/JID/ark: 

Batch: Held –> Deleted (Admin Action)

An administrative action will trigger the delete of a held batch.

Execute this step with caution since the depositor will not be notified of this action.

# DELETE /batches/bid0001/status: 
# DELETE /batches/bid0001/status-report: 
# DELETE /batches/bid0001/submission:
# for every JID in /batches/bid0001/states/batch-*/*:
#   DELETE /batches/bid0001/states/batch-completed/JID:
#   DELETE /jobs/states/STATE/*-JID if present
#   DELETE /jobs/JID/configuration:
#   DELETE /jobs/JID/status: 
#   DELETE /jobs/JID/priority: 
#   DELETE /jobs/JID/ark: 

Batch: Completed –> Automatic Cleanup

Clean up the remnants of a properly completed batch.

# DELETE /batches/bid0001/status: 
# DELETE /batches/bid0001/status-report: 
# DELETE /batches/bid0001/submission:
# for every JID in /batches/bid0001/states/batch-completed/*:
#   DELETE /batches/bid0001/states/batch-completed/JID:
#   DELETE /jobs/states/STATE/*-JID if present
#   DELETE /jobs/JID/configuration:
#   DELETE /jobs/JID/status: 
#   DELETE /jobs/JID/priority: 
#   DELETE /jobs/JID/ark: