Why Flowgen
Capture the change. Stream it anywhere, in milliseconds.
Flowgen is an open-source data activation engine. You define flows in YAML or code, run them on one node or many, and data moves between systems in milliseconds. Retries, leader election, backpressure, and circuit breakers are part of the runtime — not something you wire up per pipeline.
The problem we kept hitting
We started Flowgen while running data pipelines at a consumer business with hundreds of millions of customer profiles. Three things kept getting in our way:
Everything was batch. Schedulers fired jobs once a day. Activation systems saw data that was already stale by the time it arrived. The business felt it; we couldn’t fix it without rewriting the foundations.
Reliability was DIY. Retry with backoff, dead-letter paths, idempotency, deduplication, coordination across replicas — all hand-rolled, all slightly different per pipeline. The team spent more time rebuilding the same primitives than shipping flows.
The modern alternatives weren’t actually open. The features we needed lived behind commercial licenses, and multi-node deployment was either an afterthought or a paid tier.
What Flowgen does
- Streams by default, batches when you want. Event-driven, scheduled, and streaming flows use the same model.
- Distributed from day one. Run one replica or many. Leader election, distributed cache, and coordination are built into the runtime.
- Resiliency you don’t write. Retries with jitter, circuit breakers, backpressure, dead-letter routing, leader election — configured in YAML, applied uniformly.
- Real scripting, not a DSL. Embedded Rhai with direct access to the distributed cache, event metadata, and resource files.
- Columnar data stays columnar. Arrow RecordBatch and Avro are first-class event payloads alongside JSON — no per-task serialization tax.
- Rust under the hood. Memory safety, predictable performance, single binary to deploy.
- OpenTelemetry ready. Traces, metrics, and logs flow into whatever you already run.
- Open source, no tiers. MPL-2.0, built and maintained by CONNVE. Every connector, every feature, open. No paid edition.
Architecture
- Flows — YAML definitions containing a sequence of tasks. Event-driven, scheduled (cron/interval), or streaming.
- Tasks — Subscribers that ingest data, and processors that transform, route, query, or write it. All tasks emit events to the next task in the flow.
- Events — Data records flowing between tasks, carrying JSON, Arrow RecordBatch, or Avro payloads.
- Cache — Distributed key-value store for state management, deduplication, and leader election. Pluggable backend, currently NATS JetStream.
flow:
name: webhook_to_nats
tasks:
- http_webhook:
name: ingest
endpoint: /events
method: POST
credentials_path: /etc/http/credentials.json
- script:
name: transform
code: |
let body = event.data;
#{
id: body.id,
type: body.event_type,
received_at: timestamp_now()
}
- nats_jetstream_publisher:
name: publish
credentials_path: /etc/nats/credentials.json
url: nats://localhost:4222
subject: events.normalized
stream:
name: events
subjects:
- events.>
create_or_update: true Integrations
| Integration | Tasks |
|---|---|
| NATS JetStream | Subscriber, Publisher, KV Store |
| Salesforce | PubSub API, REST API, Bulk API, Tooling API |
| Google Cloud | BigQuery Query, Storage Read, Storage Write, Jobs |
| HTTP | Webhook, Request |
| Object Store | Read, Write, List, Move (S3, GCS, Azure, local) |
| MSSQL | Query |
| AI | Completion (multi-provider), AI Gateway (OpenAI-compatible), MCP Tools |
| Git | Git Sync |
| Core | Script (Rhai), Convert, Iterate, Buffer, Generate, Log |