menily/schema v1

An open specification for task-level demonstration data for vision-language-action (VLA) models. Six top-level fields. Controlled vocabularies. Apache-2.0. Draft v1.

Repository: github.com/MenilyIntelligence/schema · Version: menily.task-demo/1 · License: Apache-2.0 · Status: Draft

Motivation

VLA models (π0, OpenVLA, GR00T N1, Gemini Robotics, Ψ-Zero, …) consume a specific form of data: a task-level trajectory where a natural-language goal is paired with a visual context and a sequence of actions that, together, constitute one semantic unit of robot behavior.

This is different from:

Raw video — no semantic boundaries
Motion capture files — no task annotation, no language
RLHF datasets — reward signals, not demonstrations
Teleoperation traces alone — no language grounding

There is no standard format for task-level demonstration data as of April 2026. Every lab invents its own. Transfer between labs is broken. Open datasets can't be pooled. Tooling can't be reused.

menily/schema is one attempt at a common ground.

Core fields (v1 draft)

{
  "schema_version": "menily.task-demo/1",
  "task_id": "uuid",
  "language": {
    "instruction": "Pour water from the blue cup into the kettle.",
    "language_code": "en",
    "variants": ["给水壶加水", "..."]
  },
  "visual": {
    "frames": "path/to/frames/",
    "fps": 30,
    "camera_intrinsics": { "fx": 1128.5, "fy": 1128.5, "cx": 960, "cy": 540 },
    "viewpoint": "ego"
  },
  "action": {
    "space": "ee_6dof",
    "trajectory": [ /* N × action_dim */ ],
    "timestamps": [ /* N */ ],
    "gripper": [ /* N × 1 binary or continuous */ ]
  },
  "body": {
    "morphology": "bimanual_humanoid",
    "dof_map": { "right_arm": [0,1,2,3,4,5,6], "left_arm": [7,8,9,10,11,12,13] }
  },
  "meta": {
    "source": "pov_video",
    "collection_region": "SEA",
    "collection_time": "2026-01-14T08:20:00Z",
    "quality_flags": ["no_slip", "no_contact_gap"]
  }
}

Field-by-field rationale

language

Natural-language instruction (instruction), ISO language code (language_code), and optional multilingual / paraphrase variants. The variants field is recommended-required — not optional. Multi-language paraphrase is near-zero marginal cost (LLM-generated) and critical for deployment robustness. Without paraphrases, a VLA trained on this data will not generalize across user instruction styles.

visual.viewpoint

Controlled vocabulary: ego | third-person | overhead. Ego-view and third-person-view are qualitatively different training signals for visual encoders — mixing them without labels produces a model that underperforms on both.

visual.camera_intrinsics

Required for ego-view data (from Quest, Vision Pro, PICO, GoPro, etc.). Optional for third-person. The 4-parameter intrinsics (fx, fy, cx, cy) are sufficient for most downstream uses.

action.space

Controlled vocabulary. v1 supports:

ee_6dof — end-effector 6 degrees of freedom (position + rotation)
joint_Ndof — joint space (N depends on the body)
whole_body_Mdof — whole-body including locomotion

A dataset file may contain only one action space — v1 does not permit mixing action spaces within a single file. This is a deliberate constraint. Implicit mixed spaces are the most common cause of silent training signal corruption.

body.morphology

Controlled vocabulary. Represents the physical class of the embodiment:

single_arm
bimanual
bimanual_humanoid
mobile_manipulator
quadruped
humanoid (single whole-body)

body.dof_map

Required. A mapping from named joint groups (right_arm, left_arm, torso, right_leg, ...) to the index positions in action.trajectory. Without this, cross-embodiment transfer is unrecoverable — there is no way to re-index the trajectory for a different morphology.

body.link_lengths

Recommended. A dictionary of link lengths in meters. Required by length-aware retargeting tools (AdaMorph, OmniRetarget, SPARK). Absent this field, retargeting quality degrades sharply.

meta.source

Controlled vocabulary:

pov_video — first-person video (e.g., recorded from Quest, Vision Pro, GoPro)
vr_demo — VR hand-tracking sessions
mocap — optical motion capture (BVH, FBX, C3D)
teleop — teleoperated robot sessions
sim_generated — synthesized from simulation

Different sources have qualitatively different noise characteristics. Downstream training pipelines that do not know the source cannot apply source-appropriate cleaning or loss weighting.

meta.collection_region

Geographic region of collection. Values: NA / EU / SEA / EA / SA / AF / OC. Promoted to a first-class field to make geographic distribution analysis and bias auditing a default practice, not an afterthought.

Out of scope for v1

Items deliberately excluded from v1:

Reward / return-to-go fields — this is not an RL data format. Use D4RL or RLDS for reinforcement learning.
Complete scene graphs — visual tokens come from frames; scene parsing is downstream.
Human biometric metadata — not collected, no field reserved.
Embedded URDF / MJCF — body morphology is a compact index; physics simulation models are referenced externally.

Interoperability

Downstream: RLDS / Open X-Embodiment

menily/schema tasks can be exported to RLDS-compatible episode bundles via Task.to_rlds(). Semantic annotations (language, viewpoint, morphology) are preserved in episode metadata. This allows menily/schema data to flow into any Open X-Embodiment-compatible training pipeline.

Downstream: HuggingFace Datasets

Task.to_hf_dataset() produces a datasets.Dataset object for use in HF-based training. Drop-in for existing HF pipelines.

Upstream: BONES-SEED / NVIDIA SOMA

The body.morphology and body.dof_map namespaces align with NVIDIA SOMA's canonical topology. BONES-SEED motion data can be consumed directly, with task-level semantic overlay added via menily/schema.

Bidirectional: from RLDS back

from_rlds() converts existing RLDS / Open X-Embodiment datasets into menily/schema format, augmenting them with task-level semantic information extracted from episode-level annotations.

Participation

Schema v1 is a draft, not a final standard. We expect two kinds of feedback:

Field-level critique — naming, semantics, granularity, splits and merges. GitHub Issues: schema/issues
Format mapping requests — if your team has existing data pipelines, email [email protected] to discuss mapping and interoperability.

menily/toolkit — reference Python library with three Adapters (POV / VR / MoCap) that produce menily/schema-compliant output
research notes — design rationale and longer-form discussion of schema decisions
Extended technical notes (CSDN, Chinese)
About Menily Intelligence

中文说明

menily/schema 是一份针对 VLA（视觉-语言-动作）模型训练的任务级示教数据规范。定义 task_id / language / visual / action / body / meta 六大顶层字段，统一异构数据源的格式，便于跨机构数据池化与跨具身迁移。

v1 是草案，不是最终版本。欢迎通过 GitHub Issues 提字段设计建议，或通过邮件 [email protected] 讨论现有数据格式与 menily/schema 的互转方案。