Watcher and Restart Job Mechanism
Job async executors typically execute long running tasks and / or use external services to execute jobs.
If an outage occurs, the whole microservice may stall or stop entirely whilst jobs are running. These jobs are managed by a watcher. All job executors run a watcher unless they are disabled. This scheduled thread watches all running jobs (across all job executors) in order to identify the following scenarios:
QUEUED jobs may be considered as “stalled” if they have not moved to a RUNNING state during a timeout deadline. QUEUED jobs that have stalled can be taken by another job executor with an available slot.
- Any job executor microservice can take this job and move it to a RUNNING state.
- Any new job executor microservice can take this job and move it to a RUNNING state at the start.
RUNNING jobs may be considered as “stalled” if they haven’t updated the progress before the timeout deadline. If a running job stalls, it is moved to a TIMED_OUT state.
TIMED_OUT jobs can be taken by any other job executor microservice with an available slot. If a job is taken by another job executor, it is moved to a RUNNING state.
Timed Out Jobs
TIMED_OUT jobs are moved back to a RUNNING state if the original job execution progresses.
TIMED_OUT jobs may be considered as “stalled” if they still haven’t updated the progress for the second deadline timeout.
TIMED_OUT jobs call the resume() method for the same plugin.
If resume() is implemented and the job progresses, it is moved to a RUNNING state.
If resume() is not implemented or the job has not progressed, it is marked as FAILED.