What if the binding constraint on the next wave of scientific AI is not compute, not talent, and not architecture but the data underneath it all? I suspect for many reading this, the instinctive response is that data is obviously important. But the gap between acknowledging its importance and actually investing in the infrastructure to make it usable is, as the past decade has taught us, rather wide.
A Genuinely New Class of AI
Over the past few years, something meaningfully different has been happening in scientific AI. Not the familiar story of language models getting larger, but something more specific and, I would argue, more consequential. AlphaFold predicted protein structures with atomic accuracy. Evo2 decoded the regulatory grammar of DNA across all domains of life. GNOME discovered 2.2 million new materials. These are not chatbots. They are not summarising papers or generating plausible-sounding text. They are learning the actual rules that govern natural phenomena.
The Institute for Progress recently gave this class of systems a useful name: Natural Law Models (NLMs) – AI models trained on experimental data or physics-based simulations, explicitly designed to learn underlying natural phenomena. Their architectures often map onto distinct processes within the system they are studying. They are domain-specific by design. And they are producing results that general-purpose language models can’t.
When you trace the lineage of every successful NLM, you do not arrive at a clever architecture or a massive GPU cluster. You arrive at a dataset. AlphaFold was trained on the Protein Data Bank and GenBank, both products of decades of publicly funded data infrastructure. Evo2 was trained on OpenGenome2, a curated assembly of genetic sequences. GNOME and MatterGen used the Materials Project database.
The IFP framework identifies three inputs for building NLMs: large, standardised, high-quality scientific datasets; experts with deep knowledge of both AI and the domain; and substantial compute. Of these three, data is the binding constraint in most fields– not necessarily the most expensive constraint, the binding one. The one that determines whether a model gets built at all.
Leash Bio’s Hermes model illustrates the point well. Even a relatively simple transformer architecture can compete with state-of-the-art structure-based methods when trained on sufficient high-quality protein-ligand interaction data. Data quality and diversity matter more than model sophistication.
The FAIR Gap
This year marks 10 years since “The FAIR Guiding Principles for scientific data management and stewardship” were published. The State of Open Data report, has now analysed over 43,000 researcher responses across 10 years, reveals a persistent gap. While 85% of researchers believe open data is important for science, fewer than 30% describe their own data as truly FAIR-compliant.
The pharmaceutical industry offers a particularly instructive case. Billions have been invested in the promise of AI-driven drug discovery. Yet what the Pistoia Alliance calls the “scientific content crisis” in pharma, critical research data remains trapped in PDFs, siloed in proprietary databases, disconnected from the experimental context that gives it meaning. It is the primary barrier to building NLMs for drug-target interaction prediction, toxicology modelling, and clinical outcome forecasting. One cannot learn the natural laws governing drug-protein interactions if half the relevant data is buried in supplementary materials that no machine can read.
Knowledge Graphs: The Connective Tissue
FAIR compliance is necessary but not sufficient. Scientific data also needs to be connected. A gene variant in ClinVar, a drug interaction in ChEMBL, a pathway annotation in Reactome, and a population frequency in gnomAD are individually useful. Linked together in a knowledge graph, they become the substrate for a class of reasoning that neither isolated databases nor language models can achieve alone.
Knowledge graphs provide the semantic layer that transforms FAIR-compliant data into NLM training sets. They encode not just facts but relationships that a gene encodes a protein that interacts with a drug that treats a disease with a known population prevalence. This relational structure is precisely what NLMs need to learn the “natural laws” governing biological systems.
Platforms such as metaphacts’ metaphactory make this semantic layer operational. metaphactory is an enterprise knowledge graph platform that supports semantic knowledge modeling and knowledge discovery, enabling organizations to integrate heterogeneous datasets into a unified, ontology-driven graph that becomes the backbone for AI-native applications and exploration workflows. By providing a structured semantic model and an environment for discovery and collaboration, it turns disconnected life sciences data into a coherent, machine-interpretable foundation for reasoning systems. For organizations that need a high-quality scientific backbone out of the box, the Dimensions Knowledge Graph, powered by metaphactory, provides a ready-made research graph built on more than 32 billion structured statements derived from the Dimensions global research database and public data sources such as STRING and UMLS, enriched with domain ontologies. This gives teams an immediate, large-scale relational substrate that can be extended with proprietary assay data, clinical results, or real-world evidence, accelerating the transition from fragmented resources to a unified knowledge infrastructure.
The most effective AI systems for biology are not the ones with the most parameters. They are the ones with the best data infrastructure. As AI agents become central to biological research, how do we know these systems are actually working?
Phylo’s recent analysis of biology agent benchmarks reveals a challenge: failures in biology are silent. A flawed analysis can produce plausible-looking results that propagate through research pipelines for months before anyone catches the error. This is not like a chatbot hallucinating a historical date; this is a computational analysis producing a biologically plausible but incorrect result that informs real experimental decisions. Once that semantic backbone exists, AI systems must be grounded in it.
Metis, metaphacts’ knowledge-driven AI platform, combines large language models with knowledge graphs to deliver semantically precise, contextualized insights. By explicitly grounding AI outputs in the underlying semantic layer and providing built-in quality control and explainability, it reduces hallucinations and makes it possible to trace how answers were generated. In biology, where failures are often silent and plausible-looking results can propagate through pipelines before errors are detected, this traceability is essential.
The response from the field has been the development of trace-based evaluation – scoring agents on their analytical process, not just their final answers. Phylo’s BiomniBench framework evaluates five facets – data handling, tool selection, statistical rigour, source reliability, and reasoning chain coherence. This mirrors how science itself works: peer review scrutinises methods and reasoning, not just conclusions. An agent’s analytical trace is only auditable if every data source is identified, every transformation is logged, and every inference is traceable to its evidential basis. This is, in effect, a restatement of the FAIR principles applied to AI reasoning itself. Provenance is not a nice-to-have. It is the foundation of trustworthy AI in science.
An Asymmetric Bet for Pharma
Organisations that invest in FAIR data infrastructure and knowledge graphs today are not merely improving their current research workflows– they are building the training data for tomorrow’s NLMs. Even if general-purpose LLMs eventually supersede domain-specific NLMs (the “bitter lesson” argument), the underlying datasets will still be essential as training data, fine-tuning corpora, or evaluation benchmarks.
The practical implications follow directly. Pharmaceutical companies should invest in connecting their internal experimental data to public knowledge graphs using standardised ontologies and persistent identifiers. Research funders should prioritise data infrastructure alongside experiment funding. Technology providers should build knowledge graph platforms that make FAIR compliance the default, not the exception.
AlphaFold succeeded because the Protein Data Bank existed. Decades of painstaking, publicly funded data curation created the substrate on which a model could learn the rules of protein folding. The question for every scientific field is “what is our Protein Data Bank, and are we building it?” The organisations that answer this question first will define the trajectory of AI-driven discovery for the next decade.
The data infrastructure being built today is not about better workflows or compliance checklists. It is the foundation layer for a class of AI that is already producing genuine scientific breakthroughs.
The models will keep improving. But they can only be as good as the data they are trained on.
The post The Data Foundation for Natural Law Models appeared first on Digital Science.
from Digital Science https://ift.tt/7qTHDb8
No comments:
Post a Comment