Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing Permalink

Published in NeurIPS 2025, 2025

Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.

Download Paper

NeurIPS ML4PS: Product Manifold Machine Learning for Physics Permalink

Published in NeurIPS ML4PS 2024, 2024

Short-form discussion of Product Manifold Machine Learning for Physics accepted to NeurIPS ML4PS 2024

Download Paper

Journal Articles

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models Permalink

Published in Physical Review D, 2025

Self-supervised learning (SSL) is at the core of training modern large machine learning models, providing a scheme for learning powerful representations that can be used in a variety of downstream tasks. However, SSL strategies must be adapted to the type of training data and downstream tasks required. We propose resimulation-based self-supervised representation learning (RS3L), a novel simulation-based SSL strategy that employs a method of resimulation to drive data augmentation for contrastive learning in the physical sciences, particularly, in fields that rely on stochastic simulators. By intervening in the middle of the simulation process and rerunning simulation components downstream of the intervention, we generate multiple realizations of an event, thus producing a set of augmentations covering all physics-driven variations available in the simulator. Using experiments from high-energy physics, we explore how this strategy may enable the development of a foundation model; we show how RS3L pretraining enables powerful performance in downstream tasks such as discrimination of a variety of objects and uncertainty mitigation. In addition to our results, we make the RS3L dataset publicly available for further studies on how to improve SSL strategies.

Download Paper

ArXiv Papers

Product Manifold Machine Learning for Physics Permalink

Published in ArXiv, 2024

Physical data are representations of the fundamental laws governing the Universe, hiding complex compositional structures often well captured by hierarchical graphs. Hyperbolic spaces are endowed with a non-Euclidean geometry that naturally embeds those structures. To leverage the benefits of non-Euclidean geometries in representing natural data we develop machine learning on PM spaces, Cartesian products of constant curvature Riemannian manifolds. As a use case we consider the classification of “jets”, sprays of hadrons and other subatomic particles produced by the hadronization of quarks and gluons in collider experiments. We compare the performance of PM-MLP and PM-Transformer models across several possible representations. Our experiments show that PM representations generally perform equal or better to fully Euclidean models of similar size, with the most significant gains found for highly hierarchical jets and small models. We discover significant correlation between the degree of hierarchical structure at a per-jet level and classification performance with the PM-Transformer in top tagging benchmarks. This is a promising result highlighting a potential direction for further improving machine learning model performance through tailoring geometric representation at a per-sample level in hierarchical datasets. These results reinforce the view of geometric representation as a key parameter in maximizing both performance and efficiency of machine learning on natural data.

Download Paper

Nate S. Woodward

Publications

Conference Papers

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing Permalink

NeurIPS ML4PS: Product Manifold Machine Learning for Physics Permalink

Journal Articles

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models Permalink

ArXiv Papers

Product Manifold Machine Learning for Physics Permalink