Index / Notes / Comparison

Scheduled vs. Event-Driven Ingestion: A 2026 Comparison for Data-for-AI Pipelines

The decision is rarely 'which is faster' and almost always 'which one fails more gracefully when the source goes down at 3am.'

Reid Spachman 6 min read
TL;DR
  • Scheduled ingestion is honest about latency and easy to monitor.
  • Event-driven ingestion is faster and harder to debug when a webhook is lost.
  • Most real pipelines need both. Start with scheduled, add event for the sources that require it.
  • The hard problem is not 'when does the data move' but 'what does the operator see when it does not.'

Scheduled ingestion is a data-pipeline architecture in which a cron job, a systemd timer, or a daemon (cron, systemd timers, APScheduler, Airflow) fires every N minutes and pulls from the source. Event-driven ingestion is the inverse: the source pushes new records into a webhook or a queue (Kafka, Redpanda, AWS EventBridge), and the pipeline ingests them in near-real-time. The standard framing in most data-engineering writeups lands in one place: event-driven is "more modern," scheduled is "simpler with higher latency."

That framing is wrong by omission. Scheduled and event-driven are not points on a latency continuum. They are different operational disciplines. Picking between them depends less on how fast the source publishes and more on how the operator wants to see failures.

What is the difference in failure mode?

A scheduled poll has a known failure mode. The job fires on schedule. If the upstream source is down, the job logs an error, the audit row finalizes with status='error', the dashboard shows a red row for that source, and the next cycle picks up whatever is available. Operators see degraded ingestion within one polling interval. The system stays diagnostic-honest. Recovery is automatic on the next fire.

An event-driven feed has a different failure mode. The upstream source publishes. The webhook fires. Something between the publish and the queue drops the event, or the queue accepts it and the worker silently fails to process it. The dashboard shows nothing wrong. The pipeline thinks it is healthy because no events have arrived, which from inside the pipeline is identical to no events being published. The first signal a downstream consumer gets is that the model is answering with stale context. The investigation starts hours or days later, depending on how often someone hand-checks the source.

The latency difference between the two architectures, in the happy path, is one polling interval. The reliability difference, in the unhappy path, is the time it takes a human to notice a system that thinks it is healthy.

How do scheduled and event-driven compare on operating axes?

Axis Scheduled Event-driven
Latency (happy path) One polling interval Sub-second
Latency (failure path) One polling interval to next retry Time until human notices
Per-fire visibility One audit row per cycle, always None unless the worker logs explicitly
Blast radius on source failure Bounded to that source's job Can poison the queue
Retry semantics Free: next cycle is the retry Requires DLQ + retry policy
Backfill code path Same connector against date cursor Separate code path
Cost model Cycles × throughput cap Worker fleet × event volume
Best for RSS, press releases, daily macro, paper feeds Exchange data, regulatory tape, signed webhooks
Reference tooling APScheduler, Airflow, Dagster, Prefect Kafka, Redpanda, EventBridge, Pub/Sub

What does scheduled ingestion give you?

Scheduled ingestion has five operating properties that matter in production:

Visibility per fire. Every cycle generates exactly one audit row. The dashboard knows when the source last fired, what status it returned, and how many records survived to the sink. The operator never wonders whether the pipeline is running. This is the operating model Airflow and Dagster ship by default; lighter in-process daemons like APScheduler reproduce the same audit shape at smaller scale.

Bounded blast radius on source failure. If a source returns a 404, the job catches it, marks the run as errored, and the scheduler moves to the next source. One bad upstream does not poison the pipeline.

Easy retries. A failed fire is, by definition, the same as the next scheduled fire one cycle later. Retries are free.

Trivial backfill semantics. A backfill is a one-shot fire of the same connector against a date cursor. The pipeline does not need a separate code path for catch-up work.

Predictable cost. The number of cycles per day is known up front. The throughput cap per cycle is configurable. Capacity planning fits on a back-of-envelope spreadsheet.

The thing scheduled does not give you is sub-minute latency. If the upstream source publishes at 14:00:00 UTC and the pipeline fires at 14:15:00 UTC, the system has a fifteen-minute window where the data exists upstream but not yet downstream. For most sources, that window is invisible to consumers.

What does event-driven ingestion give you?

Event-driven ingestion is the right choice for a small, identifiable subset of sources. The characteristics that justify it:

Hard publication time. The source publishes at a known moment and downstream consumers act in the same moment. Equity market data, regulatory filings the second they cross the tape, exchange feeds. The fifteen-minute window costs real money.

Reliable push semantics. The source operates a webhook that retries on failure, with delivery receipts and an audit log on their side as well as yours. The SEC's EDGAR push API, Stripe's webhook delivery model, and GitHub's webhook signing are examples of source-side discipline that makes event-driven safe. Without retry-with-receipt, the pipeline relies on hope.

Idempotent handlers downstream. If the source double-publishes (and it eventually will), the pipeline tolerates duplicates without producing a duplicate downstream artifact. This is a core property of event-streaming platforms; the existence of Kafka's exactly-once semantics documentation and Pub/Sub's idempotent-processing guidance signals how non-trivial the problem is in practice.

For everything else, scheduled is the operating-discipline default. RSS feeds, press releases, regulatory release calendars, blog posts, paper feeds, daily macro releases. The fifteen-minute or four-hour latency cost is nothing compared to the operator visibility you give up by going event-driven.

Why do most teams pick the wrong one first?

The default failure mode in a data-engineering org is to design event-driven first. The architecture is "more modern." The latency story sells better. The first version of the pipeline ships with webhooks, queues, and a worker pool sized for peak load. Six weeks in, the team starts adding scheduled fallback jobs because event delivery is unreliable. Six months in, the scheduled fallback jobs are doing most of the work and the webhooks exist as a low-latency optimization for the few sources that matter. The pipeline ends up scheduled-plus-event, and the team paid a six-month tax to land there.

The cleaner starting position: scheduled by default. Tag specific sources as event-driven when the source requires it. Add the queue and the worker pool only for those sources. Let the rest run on a polling interval.

The shape that tends to age well: a per-source scheduled cadence for the long tail of feeds, plus a small set of explicitly event-driven paths for the handful of sources where sub-minute latency is load-bearing. The scheduled side gets first-class observability. The event-driven side gets a separate code path scoped to its specific source, so a failure in one event-driven feed does not propagate into the rest of the pipeline.

Diagest is the data-for-AI pipeline we are building toward at ixprt along these lines. The public health surface for the pipeline lives at diagest.ixprt.com/pipeline and is one place to see the discipline applied to real corpora.

What is the minimum honest checklist before picking event-driven?

Before committing to event-driven for a new source, the operator answers four questions:

  1. Does the source publish on a hard timestamp where downstream consumers will act in the same minute? (If no, go scheduled.)
  2. Does the source operate a retry-with-receipt webhook with their own delivery log? (If no, go scheduled or fall back to polling.)
  3. Will the worker handle duplicate publishes without producing a duplicate downstream artifact? (If no, fix dedup before going event-driven.)
  4. When the webhook fails silently, will the operator see it within one polling interval? (If no, the system is not observable enough to be event-driven.)

If all four answers are yes, ship event-driven for that source. If any answer is no, ship scheduled and revisit the source's profile after thirty days of operating data.

What is the right ingestion default for a 2026 data pipeline?

The choice is rarely between fast and slow. The choice is between two operating disciplines. Scheduled keeps the operator informed at the cost of latency. Event-driven keeps latency low at the cost of an observability surface the operator has to build before the architecture is honest about itself. Most teams underbuild the second one and overship the first one, then spend two quarters undoing the mistake.

Start scheduled. Tag the events.

Diagest is the data-for-AI pipeline we are building toward at ixprt. The public dashboard is at diagest.ixprt.com/pipeline. For the broader architecture see the Diagest product page, and for the case for publishing pipeline health at all see Why Retrieval Drift Goes Undetected: The Case for Public Pipeline Dashboards.

Frequently asked

Is event-driven ingestion always lower-latency than scheduled polling?

In the happy path, yes. In the failure path, scheduled ingestion catches the next cycle. Event-driven can drop a release silently until someone notices the downstream model is stale.

Which data sources need event-driven ingestion?

Anything with a hard publication timestamp where downstream consumers will act in the same minute: regulatory releases, exchange feeds, market-data updates. Almost nothing on a press-release RSS feed needs second-level latency.

How do you monitor a scheduled ingestion pipeline in production?

One audit table, one row per fire, with status, started-at, completed-at, docs-fetched, docs-after-dedup, and an error-message column. Read it on the dashboard. Alert when status='error' for the same source twice in a row.

Founder at ixprt. Building Diagest, AssetModel, and DailyWallStreet. Based in New York.

Want to skip the work?

Diagest absorbs the parse / clean / dedup / chunk / embed work and hands your AI exactly what it needs.

Contact us now →