Index / Notes / Pattern

Per-Corpus Schedule Modes: Why Ingestion Settings Belong on the Data Source, Not the Pipeline

A pipeline-wide ingestion mode is a category error. Each corpus has a different cost, cadence, and failure profile, and the schedule setting belongs on the corpus row, surfaced in the audit view and the public dashboard.

Reid Spachman 5 min read
TL;DR
  • A pipeline-wide ingestion setting forces every source into the same cron, which lies on the dashboard the moment one source is on a different cadence.
  • The four useful modes are scheduled, manual, paused, and on-demand. Each one shapes how the dashboard reads and how a mature alert rule would be written.
  • Store the mode on the corpus row, surface it in the audit view, and gate the scheduler on it. A single SQL join then drives the dashboard view and any downstream report that needs to know per-corpus operating posture.
  • Add an invariant test: every corpus marked scheduled must have a matching scheduler entry. That keeps stuck rows out of the dashboard.

A data-for-AI ingestion pipeline usually starts with one schedule. A cron fires every thirty minutes. Every configured source runs. The dashboard shows the latest run. The pipeline works.

Then the second source arrives, with a different shape. A weekly PDF release that nobody pulls until a human asks for it. A REST API that costs a dollar a call. A streaming feed that is always on. The single cron is now wrong for at least one of them, and the dashboard reflects that mismatch.

The cleanest fix is to move the schedule setting from the pipeline onto the corpus.

Why pipeline-wide modes fail in production

A pipeline-wide schedule is a single setting that applies to every corpus the pipeline knows about. It looks tidy in configuration. It fails in three ways once a system runs more than two corpora:

The cadence is wrong for at least one source. A cron that fires every thirty minutes is too aggressive for a quarterly SEC filing index and too slow for a real-time event stream. Stretched in either direction, the pipeline either burns API quota on a source that never changes or drops events on a source that does.

The dashboard cannot tell paused from broken. A corpus that has not produced a run in two days is either intentionally on hold or quietly failing. With a pipeline-wide schedule there is no place to record which one. The dashboard surfaces a stale timestamp; the operator has to remember.

The alert rule is unwritable. An on-call rule like "page if a corpus has missed its expected interval" needs to know what the expected interval is. A pipeline-wide schedule says they are all the same. The on-call rule then fires on every quarterly source between releases.

The four useful modes

A working set of operating modes for a production data-for-AI pipeline:

Mode What it means How a mature alert rule would read Typical sources
scheduled The scheduler runs this corpus on a fixed cadence. Last run older than the cadence plus grace. RSS feeds, daily Fed releases, hourly market data.
manual An operator triggers this corpus when source data is ready. Never on time. Errors only. Quarterly filings, ad-hoc PDF batches, vendor drops.
paused The scheduler skips this corpus until further notice. Paused for longer than a week. A source under vendor maintenance, a deprecated feed.
on-demand An external system fires the corpus on its own cadence. External call has not arrived in expected window. Webhook deliveries, CDC slots, agent-driven backfills.

The right-hand column is design guidance, not a feature claim. Alerts are downstream of the mode; the mode is what makes them writable in the first place.

The four modes cover almost every real source. A corpus is rarely ambiguous. If it is, the right call is to write down the assumption and move on; the audit table will surface the mismatch the first time the source behaves differently.

How to store the mode

The mode lives on the corpus row, not in YAML.

ALTER TABLE corpora
  ADD COLUMN schedule_mode TEXT NOT NULL DEFAULT 'manual'
  CHECK (schedule_mode IN ('scheduled', 'manual', 'paused', 'on-demand'));

Three things follow from putting the mode in the database:

The audit view gets the mode with a join. Every ingestion-run row joins back to the corpus, so the schedule mode is one query away from any historical run without a separate snapshot. The dashboard renders the run alongside the operating mode the corpus is in today.

The dashboard becomes a single SQL view. Per-corpus health rolls up from the audit table, joined to the corpus row for the schedule mode. The page renders something like "Scheduled, last run 27m ago, OK" or "Manual, last run 4d ago, OK" without the dashboard layer needing to know anything about cron.

The scheduler reads which corpora are eligible from the database. The cron itself still has to know how to fetch each one, and that wiring lives in code; the database column gates which of those wired corpora actually fire. Pausing a corpus becomes a one-row update with no redeploy.

Invariant tests keep the modes honest

Two failure modes show up the moment per-corpus schedule lands in production:

A corpus marked scheduled with no matching scheduler entry. The dashboard shows the corpus as scheduled. The scheduler never picks it up. The last-run timestamp stays NULL forever. Anyone reading the page sees a corpus that the system claims is scheduled and has never run.

A corpus marked manual with a scheduler entry pointing at it. The scheduler fires every cycle. The audit table fills with successful runs. The dashboard says the corpus is operator-triggered and is somehow refreshing every fifteen minutes.

One unit test catches both:

def test_schedule_mode_matches_scheduler_wiring():
    corpora = load_corpora_from_seed()
    for name, mode in corpora.items():
        if mode == "scheduled":
            assert name in SCHEDULER_ENTRIES, f"{name} is scheduled but has no scheduler entry"
        else:
            assert name not in SCHEDULER_ENTRIES, f"{name} is {mode} but has a scheduler entry"

Run the test in CI. Stuck rows do not reach the dashboard.

What it looks like in public

The public Diagest pipeline at diagest.ixprt.com/pipeline renders the schedule mode next to the last-run timestamp for every corpus. A reader can tell at a glance which mode each corpus is in, and how recently the most recent run completed. The audit-row source is the same join described above. There is no separate "ops view" that disagrees with the public page.

The discipline matters more than the column. A pipeline that surfaces its own operating modes — accurately, without needing translation — is one that the team operates with rather than around.

How this fits

Per-corpus schedule modes are a downstream consequence of two earlier decisions: an ingestion-run audit table that records every cycle, and a public dashboard that renders the audit honestly. With both in place, the schedule mode is a single column that turns every other operational question into a one-line query.

For the audit-table framing see Why Retrieval Drift Goes Undetected. For the dashboard discipline see Honest Zeros: Why a New Data-Pipeline Dashboard Should Launch With Empty Cells. For the trigger-shape conversation underneath all of this see Scheduled vs. Event-Driven Ingestion.

Frequently asked

Isn't per-corpus configuration just YAML in a different shape?

The shape is the same. The audit trail is not. A schedule mode that lives on the corpus row can be joined into ingestion-run history, rendered in the public dashboard, and reused by any downstream report that needs to know per-corpus operating posture. A schedule mode that lives in YAML never reaches any of those.

What's the difference between manual and paused?

Manual is the steady state for a corpus that an operator owns end-to-end. Paused is a temporary state for a corpus that is normally scheduled but is currently held back. Mixing them costs you the ability to alert on a paused corpus that has been paused too long.

Should the schedule mode column drive what the scheduler runs?

Yes, as a gate. The scheduler only runs corpora that are both wired in code and marked scheduled in the database. Anything else gets skipped. That single rule keeps the dashboard consistent with reality and prevents perpetually empty cells under a corpus the scheduler does not actually run.

How do you keep the modes from drifting out of sync?

An invariant test in the codebase. Every corpus marked scheduled must have a matching scheduler entry. Run the test in CI. Stuck rows do not reach the dashboard.

Founder at ixprt. Building Diagest, AssetModel, and DailyWallStreet. Based in New York.

Want to skip the work?

Diagest absorbs the parse / clean / dedup / chunk / embed work and hands your AI exactly what it needs.

Contact us now →