Machine learning techniques for modelling and designing chemical systems
Designing biological and chemical materials is a longstanding challenge central to the many subdisciplines of chemistry and biology. This remains a challenge in large part due to the high-dimensional nature of the design space, where the number of possible materials is unimaginably large. Furthermore, we lack robust methods that can provide cheap and accurate mechanistic insight into arbitrary materials; existing methods are either slow, expensive, or provide low-resolution information. Recently, there has been an explosion of work that has centered machine learning techniques in the mainstream as these methods have demonstrated capabilities that are remarkably human-like and solved problems long thought to be unsolvable. In this work, we build off these advances in machine learning and develop methods for designing and modelling complex molecular systems, including those out of equilibrium.
While together aimed at the broader goal of chemical design, the thesis contains three independent scientific narratives. The first narrative is aimed at the post-training of molecular language models with the goal of generating novel molecules with desired functionalities. Central to this narrative is the fact that large language models have an incredible capacity for representing diverse high-dimensional spaces, and we devise an approach that can be used to fine-tune these models to restrict sampling to narrow regions of chemical space that are of interest. The second narrative is aimed at the rapid sampling of biomolecular conformational ensembles. Free energy landscapes of biomolecules are extremely rugged and metastable, limiting the ability of simulation methods---like Molecular Dynamics---to sample ergodically for large systems. We develop methods that leverage coarse-grained physics-based models to quickly but coarsely map out this landscape and generative models that can be used to backmap the sampled coarse-grained structures ensuring, at the end, Boltzmann-distributed, atomistic resolution sampling. The third narrative is aimed at the design of nonequilibrium materials in systems that display rich responses in the presence of external stimuli. We develop reinforcement learning-based approaches that can be used to engineer novel metamaterials even with relatively limited control. Of course, this nonequilibrium control is not free and necessarily incurs an energetic cost. We introduce frameworks for understanding the entropic cost of control and conclude the work with approaches for simultaneously targeting desired outcomes and minimizing the dissipative cost of control.