Research

Short, dated working notes and draft papers on embodied AI data infrastructure. Not polished publications — this is the thinking in progress. Updated as the work moves forward.

Draft papers

Task-Level Demonstration Data for Vision-Language-Action Models: A Survey of Schemas, Adapters, and Cross-Embodiment Transfer

Masashi. Menily Intelligence, Shenzhen, China. April 2026. Draft v0.1. 12 pages.

📄 Download PDF — Task-Level VLA Data Survey (258 KB, 12 pages)
Self-hosted preprint · BibTeX citation below · CC BY 4.0

Abstract. Training vision–language–action (VLA) models for embodied AI requires task-level demonstration data — units that couple a natural-language instruction, a visual context, an action trajectory, and a body morphology specification into a single semantically closed unit. While trajectory-level datasets (Open X-Embodiment, DROID) and motion-level datasets (BONES-SEED, AMASS) have reached a degree of standardization, the task-level semantic layer that sits between them remains fragmented. This fragmentation is the primary barrier to cross-institutional data pooling and cross-embodiment transfer.

A comprehensive survey of twelve task-level demonstration data systems (2023–2026), spanning trajectory-level datasets (Open X-Embodiment, DROID, BridgeData V2, OXE-AugE), motion-level datasets (BONES-SEED, AMASS, LAFAN1), and end-to-end VLA pipelines (π0, OpenVLA, GR00T N1, Gemini Robotics, Ψ₀). The paper identifies the structural gap in the task-level semantic layer — no de-facto standard exists at this layer despite standardization at the layers above and below. The paper proposes menily/schema v1 as a candidate specification with controlled vocabularies for action space, viewpoint, morphology, and data source.

Discusses open problems including long-horizon task decomposition, multi-agent data representation, whole-body loco-manipulation boundaries, quality metrics for data ingestion, synthetic data provenance, and the governance of de-facto standards.

40 references. Covers all major systems 2023–2026. Self-hosted preprint (not on arXiv).

Citation (BibTeX)

@misc{masashi2026tasklevel,
  author       = {Masashi},
  title        = {Task-Level Demonstration Data for Vision-Language-Action
                  Models: A Survey of Schemas, Adapters, and
                  Cross-Embodiment Transfer},
  year         = {2026},
  month        = {April},
  howpublished = {Menily Intelligence Research, self-hosted preprint},
  url          = {https://www.menily.ai/research/01-task-level-vla-data-survey.pdf},
  note         = {Draft v0.1}
}

Working notes

The data gap in embodied AI, stated precisely

April 2026. The bottleneck for generalist embodied agents in 2026 is not model capacity — it is the shape, resolution, and diversity of demonstration data. Why hour-counts mislead. What task-level data actually means. Where production data shortfalls bite.

Read on GitHub

Task-level abstraction: why frame-level annotation breaks VLA

April 2026. For VLA training, frame-level annotation is the wrong unit of work. Three failure modes: the action head gets the wrong target, task boundaries become lossy post-hoc, and per-frame language does not match deployment distribution. Task-level labeling is cheaper in absolute terms and produces data VLA can actually consume.

Read on GitHub

Cross-embodiment transfer in task-level demonstration data

April 2026. Why cross-embodiment transfer fails today (implicit action space, undocumented morphology, body-relative task representation) and what a transferable demonstration requires (explicit action space, morphology identifier with DoF map, task-relative reference frames, invariant landmarks). How menily/schema and AdaMorph / OmniRetarget / SPARK tools work together.

Read on GitHub

Technical design notes

VLA 任务级示教数据 schema 设计笔记：Menily/schema v1 规范与六字段解析

April 2026 · Chinese. Long-form technical walkthrough of menily/schema v1: why each field is defined the way it is, what's deliberately out of scope, how the schema interoperates with Open X-Embodiment / RLDS and BONES-SEED / SOMA. Originally published on CSDN.

Read on CSDN

Research notes citation

@misc{menily2026notes,
  author    = {Menily Intelligence},
  title     = {Research notes: data infrastructure for embodied AI},
  year      = {2026},
  url       = {https://github.com/MenilyIntelligence/research}
}

For the survey paper citation, see the BibTeX block in the "Draft papers" section above.

Contributing

These notes are deliberately early-stage. If you are building a VLA pipeline, a humanoid robotics data operation, or a retargeting toolchain and have spotted a factual error, a missing reference, or a disagreement with a judgment — we want to hear about it.

Issues: GitHub Issues
Direct feedback: [email protected]
Schema discussions: menily/schema Issues

menily/schema v1 — the task-level demonstration data specification discussed in the survey
menily/toolkit — reference Python implementation for schema encoding/decoding
About Menily Intelligence — team, founder, and operational structure

Research

Draft papers

Task-Level Demonstration Data for Vision-Language-Action Models: A Survey of Schemas, Adapters, and Cross-Embodiment Transfer

Citation (BibTeX)

Working notes

The data gap in embodied AI, stated precisely

Task-level abstraction: why frame-level annotation breaks VLA

Cross-embodiment transfer in task-level demonstration data

Technical design notes

VLA 任务级示教数据 schema 设计笔记：Menily/schema v1 规范与六字段解析

Research notes citation

Contributing

Related