MetaOmics-10T: The Foundational Dataset to UnlockCausal Modeling of Microbial Ecosystems
We propose MetaOmics-10T—an openly shareable, foundational dataset to unlockAI-accelerated discovery in microbial ecosystems. The dataset directly enablesthree high-impact AI tasks: (1) forecasting ecosystem dynamics, (2) predictingcounterfactual outcomes of interventions, and (3) inverse-design of microbial therapies under safety constraints. MetaOmics-10T combines 10 trillion base pairsreclaimed from public archives using a Quality-Aware Tokenization (QA-Token)framework with 100,000+ interventional trajectories generated via model-guidedexperimental design. The result is a first-of-its-kind, probabilistic, interventionready corpus that addresses the principal bottleneck for causal modeling in microbiome science and provides an empirical testbed to assess the reach and limits ofcausal inference at scale.




