New Frontiers for Scientific Foundation Models

We describe the motivation for our curated portfolio of 100 Foundational Model Concepts that we are developing.

The prevailing narrative of artificial intelligence has been dominated by foundation models that emulate human cognition, primarily through language and image generation. While transformative, this represents only the first wave of a far more profound technological revolution. This report posits that the next, and arguably more significant, frontier for foundation models lies in their application to fundamental scientific discovery. The emergence of these Scientific Foundation Models (SciFMs) is the catalyst that fully realizes the "Fourth Paradigm" of science—a new era where data-intensive, AI-driven exploration becomes a primary mode of discovery, standing alongside the traditional pillars of theory, experimentation, and simulation.

This analysis identifies four key domains ripe for this transformation: the Physical Sciences, where models can navigate the vast combinatorial space of materials and molecules and even discover new physical laws; the Biological Sciences, where they can unravel the multi-scale complexity of life from genomics to whole-organism behavior; Complex System Simulation, where they can act as high-fidelity surrogates to model intractable systems like climate and turbulence; and Emergent Social Dynamics, where they can simulate in-silico societies to understand the complex interplay between individual actions and collective outcomes.

To chart a course toward this future, this report presents a curated portfolio of 100 high-potential foundation model concepts, each designed to tackle a specific, high-impact scientific challenge. However, realizing this vision is contingent on overcoming the primary bottleneck: the availability of large-scale, structured scientific data. The core strategic recommendation of this report is therefore a concerted, multi-stakeholder effort to build open, large-scale "Data Commons" for key scientific fields. This initiative, coupled with a strategy that creates a virtuous cycle between simulation and experimentation and fosters deeply integrated, cross-disciplinary research teams, forms the critical path to unlocking the unprecedented potential of these models. The successful development of SciFMs will not merely accelerate research; it will fundamentally redefine the scientific method, promising to advance human knowledge at a rate unprecedented in history.

Part I: The New Scientific Revolution - From Data to Discovery

This introductory section establishes the report's conceptual framework, differentiating scientific foundation models (SciFMs) from their well-known generative AI counterparts and positioning them as the primary tool for the new, data-intensive paradigm of science.

1.1 The Dawn of the Fourth Paradigm

The history of science can be understood as a succession of paradigms, each defined by its dominant methodology. The first paradigm was empirical, characterized by the observation and description of natural phenomena. The second was theoretical, marked by the use of models and generalizations, epitomized by Newton's laws and Maxwell's equations. The third paradigm, which emerged in the latter half of the 20th century, was computational, leveraging computer simulations to explore complex phenomena that were analytically intractable.1 Today, we are witnessing the consolidation of a fourth and profoundly different paradigm: data-intensive scientific discovery.1

This Fourth Paradigm is defined by its reliance on advanced computing capabilities to analyze, manage, and explore massive datasets generated from simulations, experiments, and sensors.1 The speed at which any scientific discipline now advances is directly dependent on its ability to harness these vast data streams.1 This shift is not merely about processing more data; it represents a fundamental change in the scientific process itself, one that requires new tools, novel research methodologies, and new modes of collaboration between domain scientists and technologists.1

While this paradigm has been emerging for over a decade, it is the recent maturation of foundation models that provides the technological catalyst to fully realize its potential. Foundation models are, by definition, large-scale models trained on vast datasets that can be adapted to a wide range of downstream tasks.5 Their ability to learn general representations from broad data makes them uniquely suited to the challenges and opportunities of data-intensive science. Therefore, it is most accurate to view foundation models not as just another tool

within the Fourth Paradigm, but as the enabling engine that makes its most ambitious goals—from autonomous discovery of physical laws to the high-fidelity simulation of entire biological organisms—achievable. This reframes the conversation from "using AI in science" to "AI as the new platform for science."

1.2 Defining the Scientific Foundation Model (SciFM): Beyond Plausibility to Physicality

A foundation model is formally defined as a large, deep learning model pre-trained on a broad spectrum of data, often through self-supervision, which can then be fine-tuned or adapted to perform a wide variety of specific tasks.5 Prominent examples like GPT-3 and BERT have demonstrated remarkable capabilities in processing and generating human language.6 However, to apply this paradigm to scientific discovery, a critical distinction must be made between these existing models and the next generation of Scientific Foundation Models (SciFMs).

Generative AI models like large language models (LLMs) are optimized for linguistic plausibility and coherence. Their objective is to generate outputs that are statistically likely given the patterns in their training data, which consists primarily of human-generated text and images. In contrast, SciFMs must be optimized for physical validity and empirical verifiability. Their success is not measured by their ability to conduct a human-like conversation but by their capacity to generate novel, valid hypotheses, accurately predict experimental outcomes, and accelerate research and development cycles.9 While an LLM might "hallucinate" plausible but incorrect information with relatively low consequence, a materials science model that hallucinates a physically impossible crystal structure or a medical model that proposes a toxic drug molecule is fundamentally failing its core purpose. SciFMs must be rigorously grounded in the laws of nature.12

This distinction gives rise to a central design tension in the development of SciFMs. On one hand, the naive application of machine learning to scientific data can lead to erroneous, "Aristotelian" conclusions, such as discovering that heavier objects fall faster because the model has not been constrained by the concept of a vacuum.12 This suggests the need for "physics-informed" AI, where known physical laws and constraints are embedded into the model's architecture and training process to ensure its outputs are physically sound. On the other hand, pre-training a model too heavily on the existing body of scientific knowledge may introduce powerful "inductive biases".14 These biases, while ensuring consistency with current theories, could fundamentally limit the model's ability to discover truly novel phenomena or physical laws that lie outside of, or even contradict, our present understanding. Navigating this trade-off—between ensuring physical realism and enabling genuine discovery—is a core research and development challenge that will define the field of scientific AI.

1.3 A Comparative Framework

To crystallize the unique characteristics of SciFMs, the following framework provides a direct comparison with the generative AI models that currently dominate the public and commercial landscape. Understanding these differences is essential for formulating appropriate R&D strategies, investment theses, and evaluation metrics for this new class of AI.

Table 1: A Comparative Framework for Foundation Models

Dimension	Generative AI (Language/Image)	Scientific Foundation Models (SciFMs)
Primary Objective	Plausibility & Coherence	Verifiability & Discovery
Core Data Modalities	Unstructured Web Text/Images	Structured Experimental/Simulation Data (e.g., genomic sequences, sensor time-series, molecular graphs, simulation outputs)
Validation & Grounding	Human preference & internal consistency	Empirical experimentation & adherence to physical laws
Key Technical Challenges	Hallucination & bias mitigation	Physical constraint satisfaction, causality inference, uncertainty quantification
Measure of Success	User engagement & task completion	Novel discoveries, predictive accuracy, accelerated R&D cycles

Part II: A Sector-by-Sector Analysis of High-Potential Domains

This section provides the analytical core of the report, examining in detail the most promising scientific and engineering fields for the development and application of Scientific Foundation Models. Each chapter outlines the core challenges of the domain, the nature of its data landscape, and the specific opportunities for transformative impact through SciFMs.

Chapter 1: The Physical Universe - FMs for Physics, Materials, and Chemistry

1.1 The Challenge: Navigating Vast Chemical and Physical Possibilities

The fundamental sciences of physics, chemistry, and materials science are confronted by challenges of immense scale and complexity. The primary obstacle in materials and drug discovery is the combinatorially explosive nature of the design space. For example, scientists estimate that the number of potentially stable small molecules relevant for battery electrolytes could be on the order of 1060.9 Exploring this "molecular universe" through traditional methods, which rely on a combination of human intuition, trial-and-error experimentation, and computationally expensive simulations, is an inefficient and painstaking process that can take years or even decades to yield a single breakthrough.

Simultaneously, in fundamental physics, our understanding of the universe remains incomplete. The process of deriving governing equations and physical principles from complex, high-dimensional, and often noisy experimental data is a monumental intellectual challenge.12 Human cognition, with its inherent biases and limitations in processing vast datasets, may overlook subtle patterns or correlations that hold the key to new discoveries, while naive machine learning approaches risk latching onto spurious correlations without grasping the underlying causal structure of reality.12

1.2 The Data Landscape: From First Principles to High-Throughput Experiments

The physical sciences benefit from a growing landscape of high-quality, structured data, forming a fertile ground for training SciFMs. A significant portion of this data comes from large-scale computational databases generated using first-principles methods like Density Functional Theory (DFT).16 Publicly accessible repositories such as the Materials Project (MP), the Open Quantum Materials Database (OQMD), and JARVIS-DFT contain hundreds of thousands to millions of calculated structures and their associated properties, providing a clean and consistent training corpus.5

This computational data is complemented by vast archives of experimental results and scientific literature. Chemical databases like PubChem and ChEMBL contain information on tens of millions of molecules.5 Furthermore, modern laboratories are increasingly equipped with high-throughput screening and automated instrumentation that generate streams of experimental and sensor data, providing the real-world grounding necessary to validate and refine computationally derived models.13

1.3 The Foundation Model Opportunity

The application of foundation models to this data landscape opens up three distinct and increasingly ambitious opportunities.

First is predictive modeling. By training on these large databases, SciFMs can learn the intricate and non-linear relationships between the structure of a material or molecule and its emergent properties. Models with architectures like Graph Neural Networks (GNNs) and Transformers, such as GraphCL, MoLFormer, and CrystalLLM, are already being developed to predict a wide range of characteristics, including electronic conductivity, thermodynamic stability, catalytic activity, and toxicity.5 These models can serve as powerful screening tools, allowing researchers to evaluate thousands of candidate compounds in silico, dramatically reducing the time and expense required for physical synthesis and testing.

Second, and more transformative, is inverse design. This approach fundamentally reverses the traditional scientific workflow. Instead of starting with a structure and predicting its properties, an inverse design model starts with a set of desired properties and generates novel, physically viable structures that exhibit them. Generative models, such as the diffusion-based MatterGen for crystal structures and the Transformer-based GP-MoLFormer for molecules, are trained to navigate the vast chemical space and propose new candidates tailored to specific applications, such as designing a material with a target band gap for a semiconductor or a molecule with high binding affinity for a specific protein.5 This shifts the role of the scientist from a manual explorer to the architect of a discovery process.

The third and most revolutionary opportunity is the autonomous discovery of physical laws. This nascent field aims to create AI systems that can move beyond applying known physics to discovering new physics. Systems like AI-Newton have demonstrated a remarkable proof of concept: by ingesting only raw observational data from simulated experiments (e.g., the positions and times of moving bodies), the system can autonomously formulate fundamental physical concepts like mass, momentum, and kinetic energy, and then use these concepts to rediscover the symbolic, mathematical form of canonical physical laws, including Newton's second law (F=ma) and the law of universal gravitation.17 This represents a paradigm shift. For a scientific discovery to be truly integrated into human knowledge, the explanation is as crucial as the prediction. The ability of these systems to produce interpretable, symbolic outputs—actual equations—rather than just opaque neural network predictions, suggests that the most advanced SciFMs will incorporate a "symbolic grounding" layer. This makes the AI's reasoning legible, its discoveries verifiable, and its output directly integrable into the enduring edifice of scientific theory.

Chapter 2: The Biological Code - FMs for Genomics, Systems Biology, and Neuroethology

2.1 The Challenge: Unraveling the Complexity of Living Systems

Biological systems are arguably the most complex phenomena known to science, characterized by intricate, multi-scale interactions that span from the atomic level of DNA to the emergent behavior of entire organisms. A central challenge in systems biology is understanding metabolism, which is not a series of linear, isolated pathways but a densely interconnected network—a "hairball" of interactions between thousands of genes, proteins, and metabolites that defies simple, rational engineering.21 Predicting the system-wide effects of a single genetic modification remains a formidable task.

In neuroscience, a grand challenge is to understand how neural circuits generate the complex, adaptive behaviors that allow animals to thrive in natural, dynamic environments. This is the core pursuit of neuroethology.23 Traditional laboratory-based neuroscience often relies on highly constrained and simplified tasks, which limits insight into how the brain actually functions in the real world, where it must integrate sensory information, guide motor actions, and learn from experience in an unceasing, complex loop.23

The life sciences are in the midst of a data explosion, driven by rapid technological advances across multiple fronts. High-throughput sequencing technologies generate petabytes of genomic, transcriptomic, and proteomic data, providing an unprecedented view into the molecular foundations of life.13 In the medical domain, the digitization of healthcare has created vast repositories of electronic health records and clinical trial data, which, when combined with information from consumer wearable devices, offer rich, longitudinal datasets on human health and disease at a population scale.3

In parallel, the field of neuroethology is being transformed by new observational technologies. High-resolution motion capture systems, dense multi-electrode arrays for recording neural activity from freely moving animals, and advanced sensors for environmental monitoring are creating comprehensive, multi-modal datasets that link neural dynamics, motor output, and sensory input with unprecedented fidelity.23 This data provides the raw material for building computational models of the brain in action.

2.3 The Foundation Model Opportunity

Foundation models are uniquely positioned to integrate and learn from this multi-modal biological data, opening new avenues for understanding, engineering, and healing living systems.

In metabolic engineering and synthetic biology, SciFMs can model the entire causal chain from genotype to phenotype. By training on vast datasets of genomic sequences and their corresponding metabolic outputs, these models can learn to predict the complex, system-wide consequences of genetic interventions. This enables a new paradigm of biological design, moving beyond single-gene edits to the rational, multi-variate optimization of entire metabolic networks for the production of valuable chemicals, biofuels, and pharmaceuticals.21 The goal is to transform microbial cells into programmable "factories" that can be engineered with the same predictability as traditional chemical plants.

In precision medicine, SciFMs can serve as powerful engines for integrating diverse patient data streams. A model trained on a combination of a patient's genomics, electronic health records, imaging data, and lifestyle information from wearable devices can generate highly personalized predictions for disease risk, treatment response, and optimal therapeutic strategies.3 Models like NatureLM are being developed with the ambitious goal of creating a unified representation that spans molecules, proteins, and DNA, enabling cross-domain applications such as designing a drug based on a patient's specific genetic makeup.13

A frontier application lies in computational neuroethology. Here, a foundation model can be trained on a comprehensive dataset capturing an animal's complete experience: its sensory inputs, the simultaneous activity of thousands of its neurons, and its resulting motor outputs. Such a model would learn the fundamental "language" of that organism's nervous system. It could be used to predict future behavior from patterns of neural activity, simulate the behavioral effects of specific neural perturbations, and ultimately, uncover the general computational principles that brains—even highly alien, non-mammalian brains like those of cephalopods—use to solve universal problems like navigation, foraging, and social communication.23

The convergence of these distinct modeling efforts points toward a more holistic and ambitious long-term goal: the creation of a "digital twin" of an entire organism. By integrating foundation models for genomics, metabolic networks, and neural control, it becomes possible to construct a multi-scale, in-silico simulation of a simple model organism, such as the nematode C. elegans. Such a digital twin would provide an unprecedented experimental platform, allowing scientists to conduct virtual experiments to test hypotheses about everything from the effects of a new drug to the neural basis of a specific behavior. This represents the ultimate synthesis of data-driven biology: a comprehensive, predictive, and executable model of life itself.

Chapter 3: Simulating Complex Systems - FMs for Climate, Turbulence, and Fluid Dynamics

3.1 The Challenge: The Computational Cost of Complexity

Many of the most critical systems in science and engineering—from the flow of air over an aircraft wing to the global climate system—are governed by complex, non-linear partial differential equations (PDEs).10 The Navier-Stokes equations, which describe the motion of fluids, are a prime example. While these equations are known, solving them directly for realistic, three-dimensional, and turbulent scenarios is a task of staggering computational complexity.31

This computational barrier forces practitioners to rely on approximations. In fluid dynamics, methods like Reynolds-Averaged Navier-Stokes (RANS) are used, but they introduce significant modeling errors by simplifying the physics of turbulence.33 In climate science, global models must parameterize crucial small-scale phenomena like cloud formation, leading to uncertainties in long-term projections. Performing a Direct Numerical Simulation (DNS) that resolves all scales of motion is computationally prohibitive for almost all practical engineering and scientific problems, creating a major bottleneck that slows innovation in aerospace design, energy production, and weather forecasting.32

3.2 The Data Landscape: A Firehose of Simulation and Sensor Data

Despite the cost, the scientific community has generated and continues to generate massive datasets that characterize these complex systems. High-fidelity simulations like DNS and Large Eddy Simulation (LES), while too expensive for routine use, can be run for canonical problems to create benchmark datasets of unparalleled accuracy and physical consistency.31 These simulations serve as a form of "computational experiment," providing perfect, complete data that is ideal for training machine learning models.

This simulation data is complemented by vast archives of real-world observational data. Decades of satellite imagery, global networks of weather and climate sensors, and measurements from experimental facilities like wind tunnels provide a continuous stream of information about the Earth's systems and engineering prototypes.33 The development of the Surya heliophysics model, for instance, was made possible by a training dataset comprising nine years of continuous, high-resolution solar observations from NASA's Solar Dynamics Observatory.35 This combination of pristine simulation data and noisy, complex observational data creates a rich and diverse foundation for training SciFMs.

3.3 The Foundation Model Opportunity

Foundation models offer a path to circumventing the computational bottleneck of direct simulation by learning the underlying physics from data.

The primary opportunity is the creation of physics-informed surrogate models. An SciFM can be trained on the input-output pairs from a large number of high-fidelity simulations. Once trained, the model learns a highly accurate mapping from the system's parameters (e.g., an aircraft's shape and speed) to its performance (e.g., the resulting lift and drag). This surrogate model can then make predictions in milliseconds, replacing the need to run a new, hours- or days-long simulation for every design change, thereby enabling real-time analysis and rapid design optimization.13

A specific and highly impactful application is in turbulence modeling. This has long been considered a "holy grail" of fluid dynamics. An SciFM can be trained on high-fidelity DNS and LES data to learn the complex physics of turbulent eddies. This learned knowledge can then be used to directly correct the known errors of cheaper RANS models or, more ambitiously, to derive entirely new and more accurate turbulence closure models from the data itself.31 This would represent a fundamental breakthrough in our ability to simulate and design systems involving turbulent flows.

In climate and weather forecasting, foundation models are already demonstrating their potential. Models like the Prithvi weather model and the Surya heliophysics model are designed to ingest and process vast amounts of spatio-temporal data from satellites and ground-based sensors.35 By learning complex patterns and long-range dependencies that are difficult for traditional numerical models to capture, these SciFMs can improve the accuracy of forecasts for everything from short-term weather patterns to long-term climate change impacts and space weather events like solar flares, which pose a risk to satellites and power grids.35

A profound implication of this technological shift is the democratization of high-fidelity simulation. Currently, the ability to perform large-scale, high-fidelity simulations is a strategic advantage held by large corporations, government labs, and well-funded research institutions with access to supercomputing resources.9 A foundation model, once trained on such a resource, can be deployed and run (a process called inference) at a small fraction of the computational cost. This means that a small engineering startup could leverage a pre-trained turbulence SciFM to achieve aerodynamic design insights that were previously accessible only to a major aerospace manufacturer. This leveling of the technological playing field could dramatically accelerate the pace of innovation across the entire economy, from renewable energy to advanced manufacturing.

4.1 The Challenge: Bridging the Micro-Macro Gap

The social sciences grapple with a fundamental challenge known as the "micro-macro gap".36 This refers to the difficulty of understanding how complex, large-scale social phenomena—such as the formation of social norms, the crash of a stock market, or the mobilization of a protest movement—emerge from the decentralized interactions of millions of individual agents.36 Human societies are complex adaptive systems, and their behavior is often non-linear, unpredictable, and counter-intuitive. Traditional modeling approaches, whether based on aggregate statistical analysis or simplified theoretical models, often fail to capture the rich, dynamic feedback loops between individual behavior and collective outcomes.

4.2 The Data Landscape: Digital Traces of Human Interaction

The data landscape for this domain is unique in that it can leverage the same kind of massive, unstructured textual data from the internet that powers conventional LLMs. This provides a rich source of information on human communication, beliefs, and culture. In addition to text, this field utilizes large, structured datasets from economic transactions, financial markets, demographic surveys, and controlled laboratory experiments designed to probe human decision-making and social behavior.38 This blend of structured and unstructured data provides a comprehensive, though often noisy, record of human social and economic life.

4.3 The Foundation Model Opportunity

The most exciting opportunity in this domain involves a radical new application of foundation models, particularly LLMs. Instead of being used as single, monolithic entities, they can serve as the cognitive "engines" for a large population of autonomous agents, creating sophisticated, large-scale agent-based models (ABMs).38 By endowing each agent in a simulation with a copy of an LLM, researchers can give them distinct goals, memories, behavioral heuristics, and the ability to communicate and reason using natural language. This allows for the creation of in-silico societies that are far more realistic and behaviorally rich than traditional ABMs.

Recent experiments have already demonstrated the power of this approach for simulating social convention formation. When placed in a simulated environment where they are rewarded for coordinating with each other, populations of LLM agents have been shown to spontaneously develop and universally adopt shared linguistic conventions and social norms, purely through decentralized interactions.38 These simulations provide a powerful, controllable, and repeatable experimental testbed for theories of social dynamics that have historically been difficult to verify empirically.

This methodology can be extended to explore complex economic and political dynamics. Agent-based simulations populated by LLM agents can be used to model the behavior of entire economies, supply chains, or political systems.36 By observing the emergent, macro-level behavior of the agent population—such as the formation of asset bubbles, the propagation of supply chain shocks, or the spread of misinformation—researchers can test the potential impacts of policy interventions and explore the underlying drivers of systemic phenomena in a controlled virtual environment.

Perhaps the most profound potential of this approach is to use AI as a "computational microscope" for society. One of the striking findings from recent research is that populations of individually unbiased LLM agents can, through their interaction dynamics, give rise to strong collective biases and polarized outcomes.38 This is a deeply significant and non-obvious result. It demonstrates that these FM-powered ABMs can be used to study the root causes of societal-level problems, like systemic bias or political polarization, that are not simply reducible to the psychology or intentions of individuals. This new tool allows social scientists to probe how these critical social challenges can emerge from the structure of our interactions and institutions, providing a new path toward understanding and potentially mitigating them.

Part III: A Curated Portfolio of 100 Foundational Model Concepts

This section presents the core deliverable of the report: a curated portfolio of 100 high-potential foundation model concepts. The list has been developed based on the sector-by-sector analysis in Part II, with a focus on concepts that are specific, possess transformative potential, and are feasible given the current trajectory of AI technology and data generation capabilities. This portfolio is intended to serve as a menu of actionable R&D targets and investment opportunities for stakeholders seeking to lead the development of the next generation of AI.

Table 2: Curated Portfolio of 100 Foundational Model Concepts

Physical Universe: Physics, Materials, and Chemistry

Materials Science

ElectrolyteExplorer - A generative foundation model conditioned on properties like ionic conductivity and electrochemical stability. It will be trained on the Materials Project database to propose novel solid-state electrolyte materials for next-generation batteries.
CatalystGen - An inverse design model that takes a target chemical reaction and desired efficiency as input. It will generate novel catalyst surfaces and molecular structures by exploring the vast chemical space of potential catalytic materials.
SuperconductorSeeker - A predictive model trained on experimental data and theoretical calculations for known superconducting materials. It will screen novel compounds to identify candidates with high critical temperatures, accelerating the search for room-temperature superconductors.
PolymerArchitect - A generative model for designing polymers with bespoke mechanical, thermal, and chemical properties. It will predict polymer performance based on monomer composition and chain architecture for applications in advanced manufacturing and sustainable plastics.
ThermoelectricDesigner - An inverse design model focused on discovering materials with high thermoelectric figures of merit (ZT). It will generate novel semiconductor compounds for efficient waste heat recovery and solid-state cooling applications.
PhotovoltaicOptimizer - A model trained on a comprehensive database of organic and perovskite solar cell materials. It will predict power conversion efficiency and long-term stability to guide the synthesis of next-generation photovoltaic technologies.
MOFBuilder - A generative model for designing novel Metal-Organic Frameworks (MOFs) with tailored pore geometries and chemical functionalities. It will be used to create materials optimized for carbon capture, hydrogen storage, and chemical separations.
CrystalPredictor-XRD - A model that learns to solve the phase problem in crystallography. It will take raw X-ray diffraction (XRD) pattern data as input and output the most probable crystal structures, automating a key bottleneck in materials characterization.
AlloyForge - A predictive model trained on metallurgical data to forecast the properties of complex alloys, such as strength, corrosion resistance, and performance at high temperatures. It will accelerate the design of new superalloys for aerospace and energy applications.
QuantumDotDesigner - A generative model for designing colloidal quantum dots with specific photoluminescent properties. It will predict emission spectra based on composition, size, and surface chemistry for advanced display and bio-imaging technologies.

Chemistry

RetroSynthAI - A foundation model for retrosynthesis that deconstructs a target molecule into simpler, commercially available precursors. It will learn from the entire corpus of chemical reaction literature to propose novel and efficient synthesis routes.
ReactionKineticsOracle - A predictive model that estimates the reaction rates and activation energies for chemical transformations. It will be trained on computational chemistry data to help chemists optimize reaction conditions without extensive experimentation.
SolventSelector - A model that predicts the effect of different solvents on reaction yield, selectivity, and rate. It will recommend optimal solvent systems to improve the efficiency and sustainability of chemical processes.
DrugDiscoverer - A generative model for de novo drug design that creates novel molecules with high predicted binding affinity to a specific biological target and favorable ADMET properties. It will integrate structural biology and bioactivity data to accelerate lead optimization.
ToxPredict - A foundation model trained on a massive database of toxicological studies. It will predict the potential toxicity of any given molecule to humans and the environment, enabling early-stage safety assessment in drug and materials development.
SpectraInterpreter - A multi-modal model that interprets complex analytical data from techniques like NMR, Mass Spectrometry, and IR spectroscopy. It will automatically elucidate the chemical structure of unknown compounds from their spectral fingerprints.
ProteinFolder-Alpha - An advanced protein structure prediction model trained on the full Protein Data Bank and metagenomic sequence databases. It will predict the 3D structure of proteins and protein complexes from their amino acid sequences with atomic-level accuracy.
EnzymeEvolver - A model that simulates the process of directed evolution for enzymes. It will predict the functional effects of mutations to guide the engineering of novel biocatalysts with enhanced activity, stability, and specificity.

Physics

QuantumNewton - An extension of the AI-Newton concept trained on experimental data from quantum mechanical systems. Its objective is to autonomously discover novel concepts and symbolic representations of quantum phenomena, potentially identifying patterns that hint at physics beyond the Standard Model.
CosmoSim-AI - A surrogate model trained on large-scale cosmological N-body simulations. It will provide rapid predictions of large-scale structure formation, such as the distribution of dark matter halos, for a given set of cosmological parameters.
ParticleColliderAnalyst - A model trained on petabytes of data from particle colliders like the LHC. It will be designed to perform real-time event classification and anomaly detection to search for new particles and rare physical processes.
PlasmaControl - A reinforcement learning-based foundation model for controlling plasma instabilities in real-time within a tokamak fusion reactor. It will learn control policies from simulation data to maintain stable, high-performance fusion plasmas.
AstroLens - A model that analyzes astronomical survey data to automatically detect and model gravitational lensing events. It will be used to map the distribution of dark matter in the universe and test theories of gravity.
StandardModelValidator - An unsupervised model trained on all known particle interaction data. Its purpose is to identify subtle deviations from the predictions of the Standard Model, pointing physicists toward areas where new physics may be discovered.
FluidMechanica - A general-purpose surrogate model for fluid dynamics, pre-trained on a vast and diverse library of canonical flow problems. It will be fine-tunable for specific engineering applications, from aerodynamics to hydraulics.

The Biological Code: Genomics, Systems Biology, and Neuroethology

Genomics/Proteomics

GeneRegulatorNet - A model that infers complete gene regulatory networks from single-cell RNA sequencing and ATAC-seq data. It will predict how transcription factors and non-coding DNA elements control gene expression in different cell types.
EpiGenomeMapper - A model that predicts the functional consequences of epigenetic modifications like DNA methylation and histone acetylation. It will help decipher how the epigenome regulates cellular identity and contributes to disease.
VariantInterpreter - A foundation model trained on population-scale genomic data and clinical records. It will predict the pathogenicity of novel genetic variants, aiding in the diagnosis of rare genetic diseases.
RNA-Struct - A model that predicts the three-dimensional structure and function of RNA molecules from their sequence. It will be crucial for understanding the roles of non-coding RNAs and for designing RNA-based therapeutics.
Proteome-Interactome - A model that predicts the complete network of protein-protein interactions within a cell. It will use sequence, structure, and expression data to map the cellular machinery underlying biological processes.
CRISPR-GuideDesigner - A model that designs optimal guide RNAs for CRISPR-based gene editing. It will predict both on-target efficiency and off-target effects to improve the safety and efficacy of gene therapies.
VirusEvolve - A foundation model trained on viral genomic sequences and epidemiological data. It will predict the evolutionary trajectories of viruses like influenza and coronaviruses, forecasting the emergence of new, potentially pandemic-causing variants.
Microbiome-Host - A model that learns the complex interactions between the human gut microbiome and host health. It will predict how changes in microbial composition affect metabolism, immunity, and disease risk.

Systems Biology

MetabolomeOracle - A predictive foundation model trained on multi-omics data to simulate the complete metabolic network of E. coli. It will predict the metabolic flux and product yield resulting from specific genetic interventions, accelerating metabolic engineering cycles.
YeastFactory - A digital twin of the Saccharomyces cerevisiae (baker's yeast) cell. It will be used to design and optimize metabolic pathways for the industrial production of pharmaceuticals, chemicals, and biofuels.
CellCycleSim - A dynamic model of the eukaryotic cell cycle. It will predict how perturbations to key regulatory proteins affect cell division, providing insights into cancer biology and regenerative medicine.
SignalingPathwayDecoder - A model that reconstructs cellular signaling pathways from phosphoproteomic and transcriptomic data. It will map how cells process information and make decisions in response to external stimuli.
SyntheticCircuitDesigner - A generative model for designing synthetic genetic circuits with predictable behavior. It will enable the engineering of cells with novel functions, such as biosensors or therapeutic delivery systems.
BiofuelOptimizer - A model focused on the metabolic engineering of photosynthetic organisms like algae and cyanobacteria. It will design genetic modifications to maximize the production of advanced biofuels from sunlight and CO2.
OrganoidGenesis - A model that simulates the self-organization and development of stem cells into organoids. It will help researchers understand tissue formation and create better in-vitro models for disease and drug testing.
Immunome-AI - A comprehensive simulation of the human immune system. It will predict the response to pathogens and vaccines, and model the dynamics of autoimmune diseases and immunotherapies.
TissueEngineer - A model that optimizes the conditions for tissue engineering, including scaffold design, growth factor cocktails, and mechanical stimuli. It will guide the development of lab-grown tissues and organs for transplantation.

Neuroethology

CephaloMind - A foundation model of the cephalopod brain, trained on neural and behavioral data from octopus and cuttlefish. It will aim to understand the principles of their distributed, non-mammalian intelligence and sophisticated camouflage abilities.
AvianNavigate - A model of the neural circuits underlying bird navigation. It will integrate data on head direction cells, grid cells, and magnetoreception to understand how birds perform long-distance migrations.
InsectBrain - A whole-brain emulation of a simpler insect, such as the fruit fly Drosophila. It will serve as a complete, executable model linking genes, neurons, and behavior in a single system.
PrimateSocialCognition - A model trained on neural recordings from primates engaged in social tasks. It will aim to decode the neural basis of complex social behaviors like cooperation, competition, and theory of mind.

Neuroscience

MotorCortex-Decoder - A foundation model for brain-computer interfaces that translates neural activity from the motor cortex into control signals for prosthetic limbs or computers. It will learn a general representation of motor intent that adapts quickly to new users.
MemoryTrace - A model of synaptic plasticity and memory engram formation in the hippocampus. It will simulate how memories are encoded, consolidated, and recalled at the circuit level.
SensoryIntegrator - A model of how the brain integrates information from multiple sensory modalities (e.g., vision, hearing, touch). It will be trained on neural responses to multi-sensory stimuli to understand the principles of perception.
SleepRhythm - A model of the neural circuits in the brainstem and hypothalamus that govern sleep-wake cycles. It will simulate the dynamics of sleep stages and their role in memory consolidation and brain health.

Complex Systems Simulation: Climate, Turbulence, and Engineering

Climate Science

GeoSurrogate-Climate - A high-fidelity surrogate for computationally expensive global climate models. It will provide rapid, ensemble-based projections of key climate variables under different emissions scenarios.
OceanCurrents-AI - A predictive model for global ocean circulation patterns, including phenomena like El Niño-Southern Oscillation. It will be trained on satellite altimetry, ocean buoys, and simulation data to improve seasonal forecasts.
AtmoChem - A surrogate model for complex atmospheric chemistry simulations. It will predict the formation and transport of pollutants like ozone and particulate matter to improve air quality forecasting.
Cryosphere-Melt - A model that predicts the dynamics of ice sheets in Greenland and Antarctica. It will learn from satellite data and physical models to provide more accurate projections of future sea-level rise.
CarbonCycle-AI - A data-driven model of the global carbon cycle. It will assimilate satellite and in-situ measurements to quantify carbon fluxes between the atmosphere, oceans, and land ecosystems.
ExtremeWeatherForecaster - A foundation model specifically trained to predict the genesis, intensity, and track of high-impact weather events like hurricanes, tornadoes, and atmospheric rivers. It will learn from decades of historical weather data and high-resolution simulations.

Earth Science

SeismicPredict - A model that analyzes continuous seismic and geodetic data streams to identify subtle precursor patterns to earthquakes. Its goal is to move beyond statistical forecasting to provide probabilistic, short-term risk assessments.
HydroCycle - A model of the global terrestrial water cycle. It will predict soil moisture, groundwater levels, and river flows to improve drought and flood forecasting.
WildfireSpread - A real-time wildfire behavior model that integrates weather forecasts, fuel maps, and topography. It will predict the rate and direction of fire spread to aid in firefighting and evacuation efforts.
SolarCycle-Surya - A foundation model trained on multi-modal solar observation data. It will predict solar flares and coronal mass ejections to improve space weather forecasting and protect critical infrastructure.

Fluid Dynamics

AeroSurrogate-1 - A physics-informed surrogate model trained on a massive dataset of high-fidelity CFD simulations and wind tunnel data for various airfoil geometries. It will provide real-time prediction of aerodynamic forces and flow fields, replacing expensive simulations in early-stage aircraft design.
TurbulenceClosure-AI - A model designed to discover new, more accurate, and generalizable closure models for RANS simulations. It will learn from DNS data to output symbolic equations that represent the Reynolds stresses, a fundamental challenge in fluid mechanics.
CombustionSim - A surrogate model for detailed chemical kinetics in combustion simulations. It will accelerate the design of more efficient and cleaner engines, gas turbines, and rocket propulsion systems.
MultiphaseFlow - A model for simulating complex multiphase flows, such as oil, water, and gas mixtures in pipelines or bubbly flows in chemical reactors. It will learn the dynamics of phase interfaces from experimental and simulation data.

Solid Dynamics

StructuralIntegrity-AI - A predictive model for material fatigue and fracture mechanics. It will forecast the remaining useful life of mechanical components by learning from sensor data and simulation of crack propagation.

Acoustics

AcousticWave - A model for predicting the generation and propagation of sound in complex environments. It will be used for applications ranging from reducing aircraft noise to designing concert hall acoustics.

Industrial Physics

GranularFlow - A model that simulates the behavior of granular materials like sand, grains, and powders. It will be used to optimize industrial processes in agriculture, pharmaceuticals, and manufacturing.

Geophysics

GeoMechanics - A surrogate model for geomechanical simulations. It will predict subsurface stress, deformation, and fracture propagation for applications in geothermal energy, carbon sequestration, and resource extraction.

Energy Systems

GridStability-AI - A foundation model of the national power grid that predicts grid stability and cascading failure risk in real-time. It will be trained on sensor data from across the grid to manage the integration of intermittent renewable energy sources.

Engineering Systems

SupplyChain-Opt - A digital twin of global supply chains. It will simulate the flow of goods and identify vulnerabilities to disruptions from geopolitical events, climate change, or pandemics.
UrbanMobility - A city-scale agent-based model of traffic and public transit. It will be used by urban planners to test the impact of new infrastructure, transportation policies, and autonomous vehicle deployment.
ManufacturingProcess-Twin - A digital twin for complex manufacturing processes, such as semiconductor fabrication or biopharmaceutical production. It will use sensor data to predict yield, optimize process parameters, and perform predictive maintenance.
BuildingEnergy-Mod - A model that predicts the energy consumption of commercial and residential buildings. It will be used to design more efficient buildings and optimize the operation of HVAC systems.
ReservoirSim - A surrogate model for petroleum reservoir simulations. It will rapidly predict oil and gas production under different operational strategies to maximize resource recovery.
BatteryLifecycle - A model that predicts the degradation and aging of batteries over their lifetime. It will be used to optimize battery management systems for electric vehicles and grid storage, extending their lifespan and performance.

Economics

MarketSim - An agent-based model foundation populated by millions of LLM agents representing consumers, producers, and investors with distinct goals and behavioral heuristics. It will be used to simulate emergent market phenomena like asset bubbles and crashes.
MacroEcon-AI - A foundation model that simulates the entire economy of a nation or region. It will be used to forecast the impact of fiscal and monetary policy changes on GDP, inflation, and unemployment.
SystemicRisk-Detector - A model of the interbank lending network and financial system. It will identify institutions that are "too connected to fail" and simulate how shocks can propagate through the system, causing financial crises.
ConsumerBehavior-ABM - An agent-based model that simulates consumer purchasing decisions and the adoption of new products. It will be trained on market data to predict how trends and fads emerge and spread through a population.
TradeFlow-AI - A dynamic model of the global trade network. It will predict how tariffs, trade agreements, and geopolitical events alter the flow of goods and impact national economies.
LaborMarket-Dynamics - An agent-based simulation of the labor market, with agents representing workers and firms. It will be used to study the effects of automation, minimum wage laws, and education policies on employment and inequality.
CryptoEcon - A model for simulating the economic dynamics and stability of decentralized finance (DeFi) protocols and cryptocurrency ecosystems. It will be used to stress-test protocols for vulnerabilities and emergent failure modes.
AuctionTheorist - A model that learns to design optimal auction mechanisms for specific environments. It will be used for applications like spectrum auctions and online advertising markets.
FirmEvolution - An agent-based model where agents are firms competing in a market. It will simulate how industries evolve over time through innovation, competition, and strategic interaction.
DevelopmentEcon-ABM - An agent-based model for studying economic development. It will simulate how factors like education, infrastructure, and institutional quality can help or hinder a region's escape from poverty traps.

NormFormation-AI - A multi-agent simulation that models how social norms and conventions, from linguistic conventions to moral norms, emerge and stabilize in a population through local interactions. It will be used to test theories of cultural evolution.
OpinionDynamics - A model of how opinions and beliefs spread and evolve within a social network. It will be used to study the drivers of political polarization and the formation of echo chambers.
CollectiveAction-Sim - An agent-based model designed to simulate the conditions under which collective action, such as protests or social movements, emerges. It will explore the roles of social networks, grievances, and critical mass dynamics.
UrbanSegregation-ABM - An advanced simulation of residential segregation in cities, extending classic models like Schelling's. It will incorporate realistic agent behaviors and economic constraints to understand the drivers of and solutions to segregation.
GovernanceAI - A simulation environment for comparing the stability and outcomes of different systems of governance. It will model how different voting rules, institutional structures, and constitutional arrangements affect political outcomes.
InfoWarfare-Detector - A model that simulates the spread of misinformation and disinformation campaigns through social media networks. It will be used to understand their dynamics and test the effectiveness of different mitigation strategies.
CulturalEvolution - A model that simulates the long-term evolution of cultural traits, such as languages, technologies, and social structures. It will explore how demographic and environmental factors shape human cultural diversity.
VoterModel-AI - An agent-based model of voter behavior. It will simulate election outcomes based on demographic data, social influence, and campaign effects to understand the dynamics of democratic elections.
InstitutionalDesign - A model that allows for the in-silico testing of new institutional designs, such as market regulations or international treaties. It will predict the likely behavioral responses and emergent outcomes of different rule sets.
SocialNetwork-Evolve - A model that simulates the co-evolution of individual attributes and social network structure. It will explore how phenomena like friendship formation and social status dynamics unfold over time.

Human-System Interaction

AI-Collaboration-Sim - An agent-based model populated by both human and AI agents. It will be used to study the emergent dynamics of human-AI teams and identify principles for designing effective collaborative intelligence systems.
PlatformEcology - A model of the dynamics of online platforms like social media or e-commerce sites. It will simulate the interactions between users, content creators, and platform algorithms to understand the health and stability of digital ecosystems.
GigEconomy-ABM - An agent-based model of the gig economy, with agents representing workers, consumers, and platform companies. It will be used to study issues of wage dynamics, labor supply, and the impact of algorithmic management.
TrustDynamics - A model that simulates how public trust in institutions, technologies, and media evolves over time. It will explore the factors that lead to the erosion or building of social trust.
EthicalAI-ABM - A multi-agent simulation where AI agents learn and evolve their behaviors. It will be used as a testbed to study how ethical or unethical collective AI behavior can emerge, even from simple individual rules, informing AI safety research.

Part IV: Strategic Imperatives for R&D and Data Acquisition

The successful development of the foundation models outlined in this report is not merely a question of algorithmic innovation; it is fundamentally a challenge of data infrastructure and organizational strategy. The primary limiting factor for the advancement of SciFMs is the availability of large-scale, high-quality, and well-structured scientific data.2 Unlike the web-scale text and image data that fueled the generative AI boom, scientific data is often expensive to generate, difficult to standardize, and siloed within individual labs, institutions, or proprietary databases. To overcome this grand challenge and unlock the future of AI-driven science, a concerted, multi-faceted strategy is required. This strategy rests on three core pillars.

4.1 A Three-Pillar Strategy for Data Dominance

A comprehensive research and development plan must be implemented to collect the sufficient, high-quality data required to begin training these new classes of foundation models. This plan begins with the establishment of large-scale, centralized "Data Commons" for key scientific domains, which will serve as the foundational infrastructure for model training and require extensive public-private partnerships to create standards for data sharing and management.2 Concurrently, a "simulation-experimentation flywheel" must be built, where high-fidelity simulations generate massive, clean datasets for pre-training models, and these models, in turn, guide more efficient physical experiments for validation, creating a virtuous cycle of data generation and discovery. Finally, this entire effort must be driven by a new organizational paradigm of cross-disciplinary "fusion" teams, deeply integrating domain scientists with machine learning and high-performance computing experts to ensure the models are both scientifically rigorous and computationally feasible.1

4.1.1 Pillar 1: Architecting the Scientific Data Commons

The foundational imperative is to treat scientific data as a public good and a core piece of research infrastructure. This requires a strategic shift away from fragmented, project-specific data collection toward the creation of large-scale, centralized, and standardized "Data Commons" for key scientific domains.40 These platforms must go beyond simple data storage; they must be architected as integrated environments that aggregate data from myriad public and private sources and provide the necessary cloud-based, high-performance computing resources for massive-scale model training. In this model, the data itself becomes part of the shared cloud infrastructure, as essential as storage or networking.40 Successfully building these commons will necessitate significant public-private partnerships and international collaboration to establish and enforce robust standards for data quality, metadata, sharing protocols, management, and reuse, ensuring the data is not only accessible but also usable for training the next generation of SciFMs.2

4.1.2 Pillar 2: Building the Simulation-Experimentation Flywheel

The second pillar focuses on creating a self-reinforcing, virtuous cycle that dramatically accelerates the rate of data generation and scientific discovery. This "simulation-experimentation flywheel" leverages the complementary strengths of computational modeling and physical experimentation. The cycle begins with high-fidelity simulations (e.g., DNS in fluid dynamics, DFT in materials science) generating vast, clean, and physically consistent datasets that are ideal for the initial pre-training of foundation models. The pre-trained models, now imbued with a foundational understanding of the system, are then used to make rapid predictions that guide more efficient and targeted physical experiments. Instead of exploring a vast parameter space blindly, experimental resources are focused on the most promising and informative areas of inquiry identified by the model. The data from these targeted experiments is then used to validate, fine-tune, and further improve the model, which in turn enables even more powerful simulations, completing and accelerating the flywheel.

4.1.3 Pillar 3: Cultivating Cross-Disciplinary "Fusion" Teams

The third and final pillar is organizational. The complexity of building and validating SciFMs demands a new model of scientific collaboration that breaks down traditional disciplinary silos. Progress will be fastest in organizations that create deeply integrated "fusion" teams, co-locating domain scientists (e.g., physicists, biologists, chemists), machine learning researchers, and high-performance computing engineers.1 In this model, the development process is not a linear handoff from one group to the next. Instead, all three areas of expertise are brought to bear simultaneously. The domain scientist ensures the model's inputs, constraints, and outputs are scientifically meaningful; the machine learning researcher designs the model architecture and training procedures; and the HPC engineer ensures the entire workflow can scale efficiently on modern supercomputing hardware. This collaborative structure is essential for navigating the fundamental trade-offs in SciFM design and for ensuring that the resulting models are not just computationally powerful but also scientifically valid and impactful.

Conclusion: The Future of AI-Driven Science

The advent of Scientific Foundation Models marks a pivotal moment in the history of science and technology. The analysis presented in this report indicates that we are moving beyond an era where AI is merely a tool for data analysis and into one where it becomes a genuine partner in discovery. The opportunities are not incremental; they are transformative. We are on the cusp of developing AI systems that can design novel materials for clean energy, engineer microorganisms to produce life-saving drugs, provide early warnings for extreme weather events, and even discover the fundamental laws that govern our universe.

Realizing this future, however, is not inevitable. It requires a strategic and sustained commitment from leaders across academia, industry, and government. The romantic image of the lone scientific genius is being replaced by a new reality where breakthroughs are achieved by collaborative, cross-disciplinary teams leveraging vast computational resources and shared data infrastructure. The primary bottleneck is no longer a lack of computational power or algorithmic ingenuity, but a deficit of large-scale, high-quality, accessible scientific data.

Therefore, the path forward is clear. The call to action is to make the bold, long-term investments necessary to build the open data commons, foster the simulation-experimentation flywheels, and cultivate the fusion teams that will power this new scientific revolution. By embracing this Fourth Paradigm and harnessing the power of Scientific Foundation Models, we can accelerate the pace of human progress and unlock solutions to some of the most pressing challenges of our time. This is not simply the next chapter in the story of artificial intelligence; it is the beginning of a fundamental redefinition of the scientific method itself.

Works cited

The Fourth Paradigm: Data-Intensive Scientific Discovery - Microsoft Research, accessed August 23, 2025, https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/
The Future of Science Policy: Data-Intensive Research - Number Analytics, accessed August 23, 2025, https://www.numberanalytics.com/blog/future-of-science-policy-data-intensive-research
The data-intensive research paradigm: challenges and responses in clinical professional graduate education - PMC, accessed August 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11842464/
The Future of Data Science: Emerging Technologies and Trends - University of the Cumberlands, accessed August 23, 2025, https://www.ucumberlands.edu/blog/the-future-of-data-science-emerging-technologies-and-trends
A Perspective on Foundation Models in Chemistry | JACS Au, accessed August 23, 2025, https://pubs.acs.org/doi/10.1021/jacsau.4c01160
What are Foundation Models? - Foundation Models in Generative AI Explained - AWS, accessed August 23, 2025, https://aws.amazon.com/what-is/foundation-models/
Foundation model - Wikipedia, accessed August 23, 2025, https://en.wikipedia.org/wiki/Foundation_model
On the Opportunities and Risks of Foundation Models - Stanford CRFM, accessed August 23, 2025, https://crfm.stanford.edu/assets/report.pdf
Building AI Foundation Models to Accelerate the Discovery of New Battery Materials, accessed August 23, 2025, https://www.hpcwire.com/2025/08/19/building-ai-foundation-models-to-accelerate-the-discovery-of-new-battery-materials/
AI and the Language of Mathematics: How Artificial Intelligence is Unlocking the Universe's Most Complex Problems | by Leon Tyron | Medium, accessed August 23, 2025, https://medium.com/@leontyron/ai-and-the-language-of-mathematics-how-artificial-intelligence-is-unlocking-the-universes-most-7db2258f9af8
The End of Physics? AI Is Discovering New Laws of the Universe - Without Us - Leximancer, accessed August 23, 2025, https://www.leximancer.com/blog/0lu21hnlp0ho7z1qxvs14jsrpx94op
Discovery of Physics From Data: Universal Laws and Discrepancies - PMC, accessed August 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC7861345/
Foundation Models Shift Paradigms for Engineering and Energy - JPT/SPE, accessed August 23, 2025, https://jpt.spe.org/foundation-models-shift-paradigms-for-engineering-and-energy
Archetype AI's Newton Model Masters Physics From Raw Data - HPCwire, accessed August 23, 2025, https://www.hpcwire.com/2024/10/28/archetype-ais-newton-model-masters-physics-from-raw-data/
Can AI Discover New Laws of Physics? A Thought Experiment in Quantum Weirdness | by Sevak Avakians | Medium, accessed August 23, 2025, https://medium.com/@sevakavakians/can-ai-discover-new-laws-of-physics-a-thought-experiment-in-quantum-weirdness-a373d369858e
Towards Foundation Models for Materials Science: The Open MatSci ML Toolkit - arXiv, accessed August 23, 2025, https://arxiv.org/pdf/2310.07864
AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge - arXiv, accessed August 23, 2025, https://arxiv.org/html/2504.01538v1
(PDF) AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge - ResearchGate, accessed August 23, 2025, https://www.researchgate.net/publication/390440166_AI-Newton_A_Concept-Driven_Physical_Law_Discovery_System_without_Prior_Physical_Knowledge
AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge | AI Research Paper Details, accessed August 23, 2025, https://www.aimodels.fyi/papers/arxiv/ai-newton-concept-driven-physical-law-discovery
[2504.01538] AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge - arXiv, accessed August 23, 2025, https://arxiv.org/abs/2504.01538
Modeling for understanding and engineering metabolism | QRB Discovery | Cambridge Core, accessed August 23, 2025, https://www.cambridge.org/core/journals/qrb-discovery/article/modeling-for-understanding-and-engineering-metabolism/18553F7A257B68AB6403E5D4551E3B65
MIT Open Access Articles The future of metabolic engineering and synthetic biology: Towards a systematic practice, accessed August 23, 2025, https://dspace.mit.edu/bitstream/handle/1721.1/99397/Stephanopoulos_Future%20metabolic.pdf?sequence=1&isAllowed=y
Computational Neuroethology: Simulating Natural Behaviors - Frontiers, accessed August 23, 2025, https://www.frontiersin.org/research-topics/71920/computational-neuroethology-simulating-natural-behaviors
Neuroethology - Wikipedia, accessed August 23, 2025, https://en.wikipedia.org/wiki/Neuroethology
Computational Neuroethology: A Call to Action - ResearchGate, accessed August 23, 2025, https://www.researchgate.net/publication/336399298_Computational_Neuroethology_A_Call_to_Action
Computational Neuroethology Unit | Okinawa Institute of Science ..., accessed August 23, 2025, https://www.oist.jp/research/research-units/cne
Parts plus pipes: synthetic biology approaches to metabolic engineering - PubMed Central, accessed August 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC3293987/
Session 6: Synthetic Biology and Metabolic Engineering - iBiology, accessed August 23, 2025, https://www.ibiology.org/sessions/session-6-synthetic-biology-metabolic-engineering/
Synthetic biology: A foundation for multi-scale molecular biology - PMC - PubMed Central, accessed August 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC3037580/
Advancement of Metabolic Engineering Assisted by Synthetic Biology - MDPI, accessed August 23, 2025, https://www.mdpi.com/2073-4344/8/12/619
Machine learning for turbulence modeling - Monolith AI, accessed August 23, 2025, https://www.monolithai.com/blog/machine-learning-for-turbulence-modeling
Can Artificial Intelligence Accelerate Fluid Mechanics Research? - MDPI, accessed August 23, 2025, https://www.mdpi.com/2311-5521/8/7/212
AI for Fluid Mechanics - TU Delft, accessed August 23, 2025, https://www.tudelft.nl/en/ae/organisation/departments/flow-physics-and-technology/aerodynamics/research/ai-for-fluid-mechanics
How Will AI Impact Computational Fluid Dynamics? - Resolved Analytics, accessed August 23, 2025, https://www.resolvedanalytics.com/ai-in-cfd/how-will-ai-impact-cfd
IBM and NASA Release Groundbreaking Open-Source AI Model on Hugging Face to Predict Solar Weather and Help Protect Critical Technology, accessed August 23, 2025, https://newsroom.ibm.com/2025-08-20-ibm-and-nasa-release-groundbreaking-open-source-ai-model-on-hugging-face-to-predict-solar-weather-and-help-protect-critical-technology
Understanding Emergent Social Phenomena Comparatively: The Need for Computational Simulation - ResearchGate, accessed August 23, 2025, https://www.researchgate.net/publication/255556995_Understanding_Emergent_Social_Phenomena_Comparatively_The_Need_for_Computational_Simulation
System Theoretic Foundations for Emergent Behavior Modeling: The Case of Emergence of Human Language in a Resource-Constrained Complex Intelligent Dynamical System | Request PDF - ResearchGate, accessed August 23, 2025, https://www.researchgate.net/publication/324363181_System_Theoretic_Foundations_for_Emergent_Behavior_Modeling_The_Case_of_Emergence_of_Human_Language_in_a_Resource-Constrained_Complex_Intelligent_Dynamical_System
Emergent social conventions and collective bias in LLM populations - PMC - PubMed Central, accessed August 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC12077490/
Emergent Social Conventions and Collective Bias in LLM Populations11footnote 1Preprint version of: Science Advances 11 (20), eadu9368 (2025). - arXiv, accessed August 23, 2025, https://arxiv.org/html/2410.08948v2
Empirical Modeling of Complex Systems | NSF - National Science Foundation, accessed August 23, 2025, https://www.nsf.gov/events/empirical-modeling-complex-systems-0/2016-03-03

100-Day Personal Knowledge Engineering Curriculum