Research
Short, dated working notes and draft papers on embodied AI data infrastructure. Not polished publications — this is the thinking in progress. Updated as the work moves forward.
Draft papers
Task-Level Demonstration Data for Vision-Language-Action Models: A Survey of Schemas, Adapters, and Cross-Embodiment Transfer
Masashi. April 2026. Draft v0.1.
A comprehensive survey of twelve task-level demonstration data systems (2023–2026), spanning trajectory-level datasets (Open X-Embodiment, DROID, BridgeData V2, OXE-AugE), motion-level datasets (BONES-SEED, AMASS, LAFAN1), and end-to-end VLA pipelines (π0, OpenVLA, GR00T N1, Gemini Robotics, Ψ₀). The paper identifies the structural gap in the task-level semantic layer — no de-facto standard exists at this layer despite standardization at the layers above and below. The paper proposes menily/schema v1 as a candidate specification with controlled vocabularies for action space, viewpoint, morphology, and data source.
Discusses open problems including long-horizon task decomposition, multi-agent data representation, whole-body loco-manipulation boundaries, quality metrics for data ingestion, synthetic data provenance, and the governance of de-facto standards.
40 references. Covers all major systems 2023–2026. Self-hosted preprint, not yet on arXiv.
Working notes
The data gap in embodied AI, stated precisely
April 2026. The bottleneck for generalist embodied agents in 2026 is not model capacity — it is the shape, resolution, and diversity of demonstration data. Why hour-counts mislead. What task-level data actually means. Where production data shortfalls bite.
Task-level abstraction: why frame-level annotation breaks VLA
April 2026. For VLA training, frame-level annotation is the wrong unit of work. Three failure modes: the action head gets the wrong target, task boundaries become lossy post-hoc, and per-frame language does not match deployment distribution. Task-level labeling is cheaper in absolute terms and produces data VLA can actually consume.
Cross-embodiment transfer in task-level demonstration data
April 2026.
Why cross-embodiment transfer fails today (implicit action space, undocumented morphology, body-relative task representation) and what a transferable demonstration requires (explicit action space, morphology identifier with DoF map, task-relative reference frames, invariant landmarks). How menily/schema and AdaMorph / OmniRetarget / SPARK tools work together.
Technical design notes
VLA 任务级示教数据 schema 设计笔记:Menily/schema v1 规范与六字段解析
April 2026 · Chinese.
Long-form technical walkthrough of menily/schema v1: why each field is defined the way it is, what's deliberately out of scope, how the schema interoperates with Open X-Embodiment / RLDS and BONES-SEED / SOMA. Originally published on CSDN.
Citation
@misc{menily2026survey,
author = {Masashi},
title = {Task-Level Demonstration Data for Vision-Language-Action
Models: A Survey of Schemas, Adapters, and
Cross-Embodiment Transfer},
year = {2026},
howpublished = {Menily Intelligence Research},
url = {https://www.menily.ai/research/}
}
@misc{menily2026notes,
author = {Menily Intelligence},
title = {Research notes: data infrastructure for embodied AI},
year = {2026},
url = {https://github.com/MenilyIntelligence/research}
}
Contributing
These notes are deliberately early-stage. If you are building a VLA pipeline, a humanoid robotics data operation, or a retargeting toolchain and have spotted a factual error, a missing reference, or a disagreement with a judgment — we want to hear about it.
- Issues: GitHub Issues
- Direct feedback: [email protected]
- Schema discussions: menily/schema Issues
Related
- menily/schema v1 — the task-level demonstration data specification discussed in the survey
- menily/toolkit — reference Python implementation for schema encoding/decoding
- About Menily Intelligence — team, founder, and operational structure