[P] SyGra: Graph-oriented framework for reproducible synthetic data pipelines (SFT, DPO, agents, multimodal)

By skyforbes Nov 28, 2025 No Comments

TL;R. We open-sourced SyGra, a graph-oriented framework for building reproducible synthetic data pipelines. Pipelines are defined as graphs (nodes = LLM calls/transforms/samplers; edges = conditional/parallel/loops). Two modes: YAML + CLI or Python library. Integrates with vLLM, HF TGI, Azure OpenAI, Ollama; HF-native I/O (streaming), provenance, schema-aware outputs.

Motivation. High-quality LLM datasets are scarce, costly, and often sensitive; teams also need fine-grained control over task structure (SFT/PO, tool use, multi-agent, multimodal). In practice, scaling “notebook pipelines” breaks down: you end up hand-wiring branching/looping flows, juggling multiple inference backends/APIs, and doing ad-hoc validation/schema checks—without resumability, sharding, or streaming. We wanted a unified, reusable graph abstraction that captures how data work actually happens (nodes/edges, subgraphs), automates quality tagging (heuristics + LLM-based scoring), and emits schema-conformant, OASST-style records—so teams can reproduce, audit, and evolve pipelines instead of rewriting glue code.

esign.

Graph model: reusable subgraphs, branching, loops; deterministic configs
Execution: pluggable model clients (vLLM/TGI/Azure/Ollama), Triton-compatible
ata I/O: Hugging Face datasets (streaming), local files; schema & metadata tracking
Reproducibility: explicit configs, seeds, artifact paths; CLI runs are fully logged

Use cases. Bootstrapping SFT/PO datasets; agent simulation & tool-use evals; multimodal assembly (image→Q&A, audio→text) etc.

Links:

Code (Apache-2.0) & REAME: github.com/ServiceNow/SyGra
Paper (design rationale, examples): arxiv.org/abs/2508.15432
PyPI: pypi.org/project/sygra/

isclosure. I’m part of the team. Feedback, issues, and PRs welcome.

By skyforbes

MachineLearning

[P] SyGra: Graph-oriented framework for reproducible synthetic data pipelines (SFT, DPO, agents, multimodal)

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

A Solid Space Flight Simulator

Pope joins patriarchs at historic Christian site in Turkey to commemorate creed and pray for unity

TIL that one of North America’s rarest dragonflies, the Hine’s Emerald Dragonfly, has nearly 360° vision, can fly around 40 km/h, and hunts with a success rate close to 95% yet only a few hundred remain because its wetland habitat is disappearing.

FRNKO – Tastes So Sweet [Melodic House/Dance]

Archives

[P] SyGra: Graph-oriented framework for reproducible synthetic data pipelines (SFT, DPO, agents, multimodal)

Like this:

By skyforbes

Related Posts

[D] Inverse hyperbolic sine as an activation function and its anti-derivative as a loss function

[D] Question and Answer Position Detection

[D] Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment.

Leave a ReplyCancel reply

You Missed

A Solid Space Flight Simulator

Pope joins patriarchs at historic Christian site in Turkey to commemorate creed and pray for unity

TIL that one of North America’s rarest dragonflies, the Hine’s Emerald Dragonfly, has nearly 360° vision, can fly around 40 km/h, and hunts with a success rate close to 95% yet only a few hundred remain because its wetland habitat is disappearing.

FRNKO – Tastes So Sweet [Melodic House/Dance]