menily/schema v1
An open specification for task-level demonstration data for vision-language-action (VLA) models. Six top-level fields. Controlled vocabularies. Apache-2.0. Draft v1.
Repository: github.com/MenilyIntelligence/schema ·
Version: menily.task-demo/1 ·
License: Apache-2.0 ·
Status: Draft
Motivation
VLA models (π0, OpenVLA, GR00T N1, Gemini Robotics, Ψ-Zero, …) consume a specific form of data: a task-level trajectory where a natural-language goal is paired with a visual context and a sequence of actions that, together, constitute one semantic unit of robot behavior.
This is different from:
- Raw video — no semantic boundaries
- Motion capture files — no task annotation, no language
- RLHF datasets — reward signals, not demonstrations
- Teleoperation traces alone — no language grounding
There is no standard format for task-level demonstration data as of April 2026. Every lab invents its own. Transfer between labs is broken. Open datasets can't be pooled. Tooling can't be reused.
menily/schema is one attempt at a common ground.
Core fields (v1 draft)
{
"schema_version": "menily.task-demo/1",
"task_id": "uuid",
"language": {
"instruction": "Pour water from the blue cup into the kettle.",
"language_code": "en",
"variants": ["给水壶加水", "..."]
},
"visual": {
"frames": "path/to/frames/",
"fps": 30,
"camera_intrinsics": { "fx": 1128.5, "fy": 1128.5, "cx": 960, "cy": 540 },
"viewpoint": "ego"
},
"action": {
"space": "ee_6dof",
"trajectory": [ /* N × action_dim */ ],
"timestamps": [ /* N */ ],
"gripper": [ /* N × 1 binary or continuous */ ]
},
"body": {
"morphology": "bimanual_humanoid",
"dof_map": { "right_arm": [0,1,2,3,4,5,6], "left_arm": [7,8,9,10,11,12,13] }
},
"meta": {
"source": "pov_video",
"collection_region": "SEA",
"collection_time": "2026-01-14T08:20:00Z",
"quality_flags": ["no_slip", "no_contact_gap"]
}
}
Field-by-field rationale
language
Natural-language instruction (instruction), ISO language code (language_code), and optional multilingual / paraphrase variants. The variants field is recommended-required — not optional. Multi-language paraphrase is near-zero marginal cost (LLM-generated) and critical for deployment robustness. Without paraphrases, a VLA trained on this data will not generalize across user instruction styles.
visual.viewpoint
Controlled vocabulary: ego | third-person | overhead. Ego-view and third-person-view are qualitatively different training signals for visual encoders — mixing them without labels produces a model that underperforms on both.
visual.camera_intrinsics
Required for ego-view data (from Quest, Vision Pro, PICO, GoPro, etc.). Optional for third-person. The 4-parameter intrinsics (fx, fy, cx, cy) are sufficient for most downstream uses.
action.space
Controlled vocabulary. v1 supports:
ee_6dof— end-effector 6 degrees of freedom (position + rotation)joint_Ndof— joint space (N depends on the body)whole_body_Mdof— whole-body including locomotion
A dataset file may contain only one action space — v1 does not permit mixing action spaces within a single file. This is a deliberate constraint. Implicit mixed spaces are the most common cause of silent training signal corruption.
body.morphology
Controlled vocabulary. Represents the physical class of the embodiment:
single_armbimanualbimanual_humanoidmobile_manipulatorquadrupedhumanoid(single whole-body)
body.dof_map
Required. A mapping from named joint groups (right_arm, left_arm, torso, right_leg, ...) to the index positions in action.trajectory. Without this, cross-embodiment transfer is unrecoverable — there is no way to re-index the trajectory for a different morphology.
body.link_lengths
Recommended. A dictionary of link lengths in meters. Required by length-aware retargeting tools (AdaMorph, OmniRetarget, SPARK). Absent this field, retargeting quality degrades sharply.
meta.source
Controlled vocabulary:
pov_video— first-person video (e.g., recorded from Quest, Vision Pro, GoPro)vr_demo— VR hand-tracking sessionsmocap— optical motion capture (BVH, FBX, C3D)teleop— teleoperated robot sessionssim_generated— synthesized from simulation
Different sources have qualitatively different noise characteristics. Downstream training pipelines that do not know the source cannot apply source-appropriate cleaning or loss weighting.
meta.collection_region
Geographic region of collection. Values: NA / EU / SEA / EA / SA / AF / OC. Promoted to a first-class field to make geographic distribution analysis and bias auditing a default practice, not an afterthought.
Out of scope for v1
Items deliberately excluded from v1:
- Reward / return-to-go fields — this is not an RL data format. Use D4RL or RLDS for reinforcement learning.
- Complete scene graphs — visual tokens come from frames; scene parsing is downstream.
- Human biometric metadata — not collected, no field reserved.
- Embedded URDF / MJCF — body morphology is a compact index; physics simulation models are referenced externally.
Interoperability
Downstream: RLDS / Open X-Embodiment
menily/schema tasks can be exported to RLDS-compatible episode bundles via Task.to_rlds(). Semantic annotations (language, viewpoint, morphology) are preserved in episode metadata. This allows menily/schema data to flow into any Open X-Embodiment-compatible training pipeline.
Downstream: HuggingFace Datasets
Task.to_hf_dataset() produces a datasets.Dataset object for use in HF-based training. Drop-in for existing HF pipelines.
Upstream: BONES-SEED / NVIDIA SOMA
The body.morphology and body.dof_map namespaces align with NVIDIA SOMA's canonical topology. BONES-SEED motion data can be consumed directly, with task-level semantic overlay added via menily/schema.
Bidirectional: from RLDS back
from_rlds() converts existing RLDS / Open X-Embodiment datasets into menily/schema format, augmenting them with task-level semantic information extracted from episode-level annotations.
Participation
Schema v1 is a draft, not a final standard. We expect two kinds of feedback:
- Field-level critique — naming, semantics, granularity, splits and merges. GitHub Issues: schema/issues
- Format mapping requests — if your team has existing data pipelines, email [email protected] to discuss mapping and interoperability.
Related
- menily/toolkit — reference Python library with three Adapters (POV / VR / MoCap) that produce
menily/schema-compliant output - research notes — design rationale and longer-form discussion of schema decisions
- Extended technical notes (CSDN, Chinese)
- About Menily Intelligence
中文说明
menily/schema 是一份针对 VLA(视觉-语言-动作)模型训练的任务级示教数据规范。定义 task_id / language / visual / action / body / meta 六大顶层字段,统一异构数据源的格式,便于跨机构数据池化与跨具身迁移。
v1 是草案,不是最终版本。欢迎通过 GitHub Issues 提字段设计建议,或通过邮件 [email protected] 讨论现有数据格式与 menily/schema 的互转方案。