Index / Diagest

Raw data in.
AI-ready data out.

Name: Diagest
Brand: ixprt

A pipeline that ingests at scale, cleans the noise, organizes what remains, and hands AI systems exactly what they need — at the shape and quality they need it in.

Contact us now How it works

diagest.ixprt.com / pipeline · v1.07 · LIVE

SYSOPERATIONAL INGEST240k /s PROC LAG14ms WORKERS128/128 GPU14×H100 MODELS9 LOADED QUEUE1,402 REGIONUS-EAST-1 UTC14:32:08.117 EVAL · GREEN

THROUGHPUT

240k /s

7d avg 215k

VOLUME · 24H

1.6 TB

+8.2% vs 7d

SCHEMA VALID

99.42%

drift 0.04

NOISE REMOVED

−38.4%

post-dedup + filter

AI-READY OUT

982 M

vectors / day

P99 LAT

42 ms

SLO 80

$ / GB

$0.014

unit cost

Source connectors · 11 active

S3 stream1.2 TB

HTTP API poll340 M

Webhooks push4.1 M

PDF / DOCX batch82 M

Postgres cdc19 M

Kafka stream208 M

RSS / feeds poll880 K

Email IMAP batch14 K

Snowflake batch—

Pipeline DAG · 8 stages · live throughput

Live event log · ●

DEDUPs3-bucket-a142,803 near-duplicates merged via MinHash98.7%0s

PARSEPDF/86extracted 4,210 structured rows from 86 reports100%3s

CHUNKcorpus-7tokenized + chunked 12.4k docs · 512t window512t8s

EMBEDix-embed-l982M vectors written → vector store1024d12s

DRIFTschema/v17RSS feed-3 dropped column "publishedTs" → backfill0.4σ31s

EVALretrievalrecall@10 = 0.94 on holdout · drift 0.020.941m

Embedding space · UMAP proj · 50k sample · 1024 → 2D

3 clusters · 982M pts silhouette 0.71 knn k=8

Models & parsers

IX-EMBED-Lv3.2 · 1024d · R@10 0.94

IX-CHUNKv1.8 · semantic · 512t

IX-DEDUPv2.1 · MinHash · 256 perm

IX-PARSE-PDFv4.0 · table-aware

IX-CLASSv2.6 · 14 labels · F1 0.88

IX-NERv1.4 · 22 entities

IX-CLEANv3.1 · regex + LM rules

AI-ready outputs · 6 sinks

Vector store qdrant982 Mlive

Parquet s33.2 TBlive

JSONL s31.1 TBlive

HF dataset hub122 klive

REST API https42 mslive

Snowflake batch—paused

Inferred schema · v17 · corpus-7

// auto-generated · drift 0.04σ
id            string · uuid      // 100% present
source        enum · 11 levels   // 100% present
timestamp     datetime · ISO8601  // 100% present
content       text · < 32k tokens // 99.8% present
embedding     float32[1024]      // 100% present
labels        array<enum>         // 88% present · 14 levels
entities      array<span>         // 76% present · 22 types
quality       float · [0,1]       // 100% present · μ 0.91
lineage       array<ref>          // 100% present

Quality scorecard · last 24h

Schema validity

99.4

Completeness

96.1

Uniqueness (post-dedup)

99.0

Freshness P95

88.0

Embedding coverage

100

Retrieval recall@10

94.0

Drift (ks-stat avg)

0.14

VECTORS982,114,206 DEDUP RECALL98.7% CORPUScorpus-7 CHUNK MEAN487t UNIQUE TOKENS2.41 B RETRIEVAL R@100.94 SCHEMA DRIFT0.04σ CACHE HIT87.4% $ / GB$0.014 REGIONUS-EAST-1 SLA99.97% VECTORS982,114,206 DEDUP RECALL98.7% CORPUScorpus-7 CHUNK MEAN487t UNIQUE TOKENS2.41 B RETRIEVAL R@100.94 SCHEMA DRIFT0.04σ CACHE HIT87.4% $ / GB$0.014 REGIONUS-EAST-1 SLA99.97%

Who it's for

Built for teams burning time on data prep.

Diagest replaces the bespoke ETL + cleaning + chunking + embedding work that sits between raw inputs and a model that's actually useful.

AI Labs · Frontier Research

Train on cleaner data

Deduplicated, schema-aligned, drift-monitored corpora. We absorb the messy work so your team focuses on architecture, not pipelines.

Enterprise Data Teams

One retrieval-ready surface

Bring private documents, databases, and event streams into a single retrieval surface. Vector store, Parquet, JSONL, or REST — your call.

Research Organizations

Convert archives to corpora

PDF/DOCX archives, RSS feeds, historical APIs become AI-queryable with provenance, quality scoring, and retrieval-recall measurement built in.

Custom AI Products

Stop burning weeks on prep

If you're building an AI product and burning weeks on data prep, that's exactly the work Diagest replaces.

How it works

From any source to any AI-ready sink.

01 · Connect

Any source

S3, HTTP APIs, webhooks, PDF/DOCX, Postgres CDC, Kafka, RSS, email. New connectors on request.

02 · Process

8-stage pipeline

Ingest → parse → clean → validate → dedup → filter → chunk → embed. Quality + drift tracked end-to-end.

03 · Deliver

AI-ready outputs

Vector store, Parquet, JSONL, HF dataset, REST. One schema, full provenance, query-ready.

Get on the platform.

Tell us about your sources, your model, and what AI-ready output you need. We'll come back with a scoped pilot.

Contact us now

FAQ

Frequently asked.

What is Diagest?

A data-for-AI pipeline. It consumes large data volumes from any source, cleans and deduplicates them, filters out noise, and organizes what remains into AI-ready outputs — vectors, Parquet, JSONL, or REST API.

What sources does Diagest ingest?

Object storage (S3 and equivalents), HTTP APIs, webhooks, PDF and DOCX archives, SQL databases via CDC, Kafka streams, RSS feeds, and email mailboxes. New connectors are added on request.

What output formats does Diagest produce?

Vector store (Qdrant-compatible), Parquet on S3, JSONL on S3, Hugging Face datasets, REST API, and on-demand sinks. All outputs share a single schema with provenance and quality scoring.

How does Diagest handle data quality and drift?

Schema validity, completeness, uniqueness, freshness, embedding coverage, retrieval recall, and KS-stat drift are tracked per corpus per 24h. Drift alerts trigger schema-version bumps and backfills.

What is the latency profile?

Streaming sources reach AI-ready output in tens of milliseconds. Batch sources are throughput-limited but typically complete inside the same hour for typical corpus sizes.

How is pricing structured?

Per-GB processed plus a flat platform fee. Contact us for quotes calibrated to your sources, output sinks, and SLA needs.

From the blog

Notes on what we're building.

Buying Guide

Raw data in.
AI-ready data out.

Built for teams burning time on data prep.

Train on cleaner data

One retrieval-ready surface

Convert archives to corpora

Stop burning weeks on prep

From any source to any AI-ready sink.

Any source

8-stage pipeline

AI-ready outputs

Get on the platform.

Frequently asked.

What is Diagest?

What sources does Diagest ingest?

What output formats does Diagest produce?

How does Diagest handle data quality and drift?

What is the latency profile?

How is pricing structured?

Notes on what we're building.

What is Data-for-AI? A Buyer's Guide to the Modern Stack

Vector Store Choices in 2026: Qdrant vs Pinecone vs pgvector vs Weaviate vs Milvus

Why RAG Pipelines Fail: 5 Common Pitfalls

Raw data in.AI-ready data out.

Built for teams burning time on data prep.

Train on cleaner data

One retrieval-ready surface

Convert archives to corpora

Stop burning weeks on prep

From any source to any AI-ready sink.

Any source

8-stage pipeline

AI-ready outputs

Get on the platform.

Frequently asked.

What is Diagest?

What sources does Diagest ingest?

What output formats does Diagest produce?

How does Diagest handle data quality and drift?

What is the latency profile?

How is pricing structured?

Notes on what we're building.

What is Data-for-AI? A Buyer's Guide to the Modern Stack

Vector Store Choices in 2026: Qdrant vs Pinecone vs pgvector vs Weaviate vs Milvus

Why RAG Pipelines Fail: 5 Common Pitfalls

Raw data in.
AI-ready data out.