
Motivation. High-quality LLM datasets are scarce, costly, and often sensitive; teams also need fine-grained control over task structure (SFT/PO, tool use, multi-agent, multimodal). In practice, scaling “notebook pipelines” breaks down: you end up hand-wiring branching/looping flows, juggling multiple inference backends/APIs, and doing ad-hoc validation/schema checks—without resumability, sharding, or streaming. We wanted a unified, reusable graph abstraction that captures how data work actually happens (nodes/edges, subgraphs), automates quality tagging (heuristics + LLM-based scoring), and emits schema-conformant, OASST-style records—so teams can reproduce, audit, and evolve pipelines instead of rewriting glue code.
esign.
- Graph model: reusable subgraphs, branching, loops; deterministic configs
- Execution: pluggable model clients (vLLM/TGI/Azure/Ollama), Triton-compatible
- ata I/O: Hugging Face datasets (streaming), local files; schema & metadata tracking
- Reproducibility: explicit configs, seeds, artifact paths; CLI runs are fully logged
Use cases. Bootstrapping SFT/PO datasets; agent simulation & tool-use evals; multimodal assembly (image→Q&A, audio→text) etc.
Links:
- Code (Apache-2.0) & REAME: github.com/ServiceNow/SyGra
- Paper (design rationale, examples): arxiv.org/abs/2508.15432
- PyPI: pypi.org/project/sygra/
isclosure. I’m part of the team. Feedback, issues, and PRs welcome.