- Journal List
- Sci Adv
- v.7(39); 2021 Sep
- PMC8462901
Sci Adv. 2021 Sep; 7(39): eabj2465.
This article has been cited by other articles in PMC.
Abstract
The rule-based search of chemical space can generate an almost infinite number of molecules, but exploration of known molecules as a function of the minimum number of steps needed to build up the target graphs promises to uncover new motifs and transformations. Assembly theory is an approach to compare the intrinsic complexity and properties of molecules by the minimum number of steps needed to build up the target graphs. Here, we apply this approach to prebiotic chemistry, gene sequences, plasticizers, and opiates. This allows us to explore molecules connected to the assembly tree, rather than the entire space of molecules possible. Last, by developing a reassembly method, based on assembly trees, we found that in the case of the opiates, a new set of drug candidates could be generated that would not be accessible via conventional fragment-based drug design, thereby demonstrating how this approach might find application in drug discovery.
INTRODUCTION
Chemical space is populated by a vast range of compounds, which can be characterized by their molecular composition, formula, graph representation, and reactivity (1, 2). The generation of molecules via their graphs can be enumerated to give an unimaginably vast number of at least 1060 small organic molecules (1, 3, 4), but this is unrealistic, as many of these molecules might be unstable or inaccessible synthetically. However, when exploring biochemistry, only a few hundred different types of “unique” small molecules are needed by the simplest living organisms (1, 5), indicating that the chemical space relevant to biology on Earth is a tiny fraction of chemical space that is possible (6, 7). For example, many of the related structures between these known compounds are undiscovered and unknown (e.g., opiates and cannabinoids are found to occur in related clusters of structures but no cross between these structure types are known). The issue therefore arises about how the space of molecules can be effectively searched (1, 2, 8) and what constrains molecules to be both thermodynamically possible and biologically accessible (2, 9–11) because although many molecules are physically possible, the number of molecules accessible by the current machinery of biology is smaller (9, 12–15).
One way to explore the universe of molecules is to construct a chemical space as a dataset, e.g., GDB-17 database (16, 17), DrugBank (18), and PubChem (19), and then navigate the dataset using molecular descriptors (20–23). Searching these databases is inefficient because it requires exhaustively enumerating [or Bayesian optimization (24)] and screening molecules for desirable properties. Similarly, it is possible to iteratively generate chemical subspaces, followed by filtering unwanted structures, until desired molecules are obtained. This can be done using a genetic algorithm (25, 26), extrapolation techniques (27, 28), or even using human intuition (29). However, given the relative sizes of possible chemical space, compared to the number of interesting molecular structures, it is not clear how comprehensive any of these approaches will be. Recently, machine learning and statistical techniques have been introduced into the navigation of chemical space (30, 31). For example, hundreds of thousands of existing chemical structures were used to train a deep neural network so that each molecule can be assigned to a set of discrete coordinates in the continuous latent space of the neural network. Navigating within this discrete chemical space corresponds to navigating in the continuous latent space, which is much easier computationally (32). Alternatively, a convolutional neural network can be trained directly on graph representations of molecules to infer their molecular features, and these can have a better predictive performance over the existing hand-crafted fingerprints in some applications (33). These approaches offer improvement over raw enumeration and filtering because they compress the search space. However, while neural networks might make the navigation of chemical space more efficient, the space is obscured and important contingent information is not accessible (1, 2). One important question is how the current structure of observable chemistry relates to the space of biology. This is interesting because evolution has selected the machinery of biochemistry over the past 4 billion years on Earth (1, 10, 11, 34). By constructing the assembly tree from molecular structures, we will be able to not only use a molecular-based route to explore the information used to assemble the molecules found in biology but also infer which molecules are more likely to have shared pathways (i.e., infer the presence of new biological pathways). This is another important window not only on how the process of evolution by random selection leads to both conservation and reuse of biochemical pathways but also in the generation of novelty.
RESULTS AND DISCUSSION
Establishing the theory of assembly spaces of molecules
Assembly theory (35, 36) quantifies the constraints required to produce a molecule by measuring the minimum number of steps to produce the molecular graph thereof. Here, we apply this approach to explore the structure of chemical space and suggest a way to generate new compounds from the assembly space (see Fig. 1). In the adenine example, we chose the four chemical bonds that make molecule, namely, C─C, C═C, C─N, and C═N, as the basic building blocks. We call these building blocks and the molecular structures that will be produced therefrom the assembly building blocks and call the set of all assembly building blocks the assembly pool. In an assembly pool, any type of assembly building block is assumed to have infinite instances. One assembly step is precisely defined as the three sequential operations: (i) take two assembly building blocks from the assembly pool, (ii) join the two together in a way based on the particular system in question (in this context, that is, to superimpose certain atoms from the two building blocks to make a larger molecular structure), and (iii) add the new composite building block to the assembly pool. Once a sequence of the assembly steps can successfully construct the target molecule, this is defined as an assembly pathway for this molecule. Figure 1A shows one of many assembly pathways of adenine, which has seven assembly steps after which adenine is constructed and appears in the assembly pool and gives an assembly index of 7. Last, the assembly space of a molecule refers to all the assembly building blocks included in the assembly pathways. The assembly index of the shortest pathway to construct a molecule is referred to as the molecular assembly number (MA) of the molecule. The assembly pathway shown in Fig. 1A is the shortest one, calculated out by the Monte Carlo (MC) algorithm that we have developed (which is guaranteed to give the shortest assembly pathway when the computing time is sufficiently long; see section S3 for details), and adenine’s MA is thus 7.
Representations of an assembly pathway, by taking adenine as an example.
(A) One of the many assembly pathways of adenine (it turns out to be the shortest one, according to our MC algorithm, explained later). The assembly pool (shown inside the dashed boxes) evolves with each assembly step. The colors denote which two assembly building blocks are used to make the new one (note that the color schemes are independent for each step). (B) The key-step representation of the assembly pathway. (C) The joint process for each key assembly step, which is used to work out the multiset representation. (D) The multiset representation of this assembly pathway. Strictly speaking, it should be written as {[1]1, [2]1} where the superscript “1” is the multiplicity of this assembly building block, that is, after canceling out, it appears once on the left-hand side of (C), but for simplicity, we only explicitly write down the multiplicity when it is not 1.
In assembly pathways, the order of some steps can be switched without changing the length of the pathway, which leads to a combinatorically large number of trivial pathways that all have the same number of steps. For instance, as the pathway shown in Fig. 1A starts from building block [3] to make building block [6], we can either add a C─C bond to [3], then a C═C, and lastly add building block [2], just as the figure shows, or we can add [2] first, then a C═C, and, last, a C─C, which leads to another pathway. Taking Fig. 1A as an example, building block [1] must be made before [2] and [3] because [1] is used to make [2] and [3]; by the same logic, building block [2] must be made before [3] and [6] because [2] is used to make them. We can use these properties to represent assembly pathways without ambiguity by focusing on the steps in which order matters. We call those special building blocks that define the hierarchical relationships among the chemical structures the key assembly building blocks (they are also the assembly building blocks that are used more than once in the pathway) and their corresponding steps as the key assembly steps.
Therefore, we can represent a pathway in terms of key steps, which eliminates all the trivial information. For example, Fig. 1B is the key-step representation of the pathway shown in Fig. 1A, and the key building blocks are [1] and [2]. The number of key building blocks can be elucidated from the key-step representation. We can explicitly write down the joint process for each key step and then remove the building blocks that appear on both sides (Fig. 1C). The building blocks left over constitute the target molecule nonrepetitively. Specifically, the target molecule adenine can be made from these chemical bonds and structures nonrepetitively and by the least number of assembly steps. The information of the basic building blocks is trivial and can thus be omitted. So, we lastly obtained the multiset representation of this assembly pathway, as shown in Fig. 1D (see section S2). The multiset representation can be readily determined from the key-step representation without ambiguity and vice versa. The latter emphasizes the hierarchical relationship between the building blocks, while the former emphasizes the information of constituents and provides a compact summary of the assembly space associated with this pathway. An assembly pathway of a molecule does not necessarily correspond to a realistic sequence of chemical reactions that produce this molecule. Instead, the shortest assembly pathway bounds the likelihood of the molecule forming probabilistically (which means that if the shortest assembly pathways of two molecules overlap, then they are likely to have shared synthetic pathways). No matter which methods or synthetic approaches are used, there will be no shorter way than this ideal one, which makes it an intrinsic property of a molecule.
Molecular assembly trees
The concept of assembly pathways and spaces can be naturally applied to two molecules, which allows us to look at the shortest assembly pathways that construct both simultaneously. In general, the shared shortest assembly pathway of A and B is not the union of the individual shortest assembly pathways of A and B. As an example, consider adenine and another nucleobase thymine (Fig. 2A). The shortest pathway of adenine alone is indicated by the blue dashed arrows on the left whose MA is 7 (the same pathway as in Fig. 1).
Two exemplified molecular assembly trees.
(A) The assembly space of adenine and thymine. The shortest assembly pathway for adenine alone is indicated by the blue dashed arrows, while the shortest assembly pathway for thymine alone is indicated by the red dashed arrows. The shortest assembly pathway to make adenine and thymine altogether is the one indicated by the black dashed arrows. (B) A molecular assembly tree for A, G, T, U, and C, which can also be written as {[2, 10, 11, 12, 13]}, whose index is calculated to be 16. Note that, in both (A) and (B), the colors are just used to help the reader recognize the building blocks, and the color schemes are independent; we also omitted the arrows starting from the basic building blocks for a better visualization.
The shortest pathway of thymine alone is indicated by the red dashed arrows on the right, which can be written in multiset representation as {[7]}, and its MA is calculated to be 6. However, the shortest assembly pathway to make adenine and thymine altogether is indicated by the black dashed arrows in the middle, which does not overlap with either of the shortest pathways. It is {[8]2, [9]} in multiset representation (where the superscript “2” is the multiplicity of [8]), and its pathway index is calculated to be 12, which is smaller than 7 + 6, the sum of the two individual MAs. Molecular assembly theory can be extended further to three or more molecules, which allows us to look at the shortest assembly pathways that construct a group of molecules. The multimolecular assembly spaces tend to have a tree-like structure where different branches lead to different molecules (see Fig. 2A), but the number of key building blocks is still relatively small. Therefore, we refer to the shortest assembly pathways to make a group of molecules altogether as the molecular assembly tree (assembly tree for short) thereof and refer to its index as assembly tree MA.
As an example, we built an assembly tree for the five nucleobases: adenine (A), guanine (G), thymine (T), uracil (U), and cytosine (C). We first extended the MC algorithm that we have developed for a single molecule to a group of molecules (see details in section S4). We then use this extended algorithm to compute the assembly tree and then visualize the tree manually, as shown in Fig. 2B (see section S4.3). The MA is 16, but it takes 43 steps to build the molecules bond by bond, and the minimum number of steps to construct them separately using an assembly process is 33. This relatively low MA reflects the fact that they share lots of common structures (even the common structures share quite a few common substructures), resulting in the hierarchy shown by the assembly tree, which represents a highly related subset of chemical space.
Biomolecules
For this study, we picked a dozen vital biomolecules to construct their assembly tree, including the five nucleobases (A, G, T, C, and U), pyruvate [a key intermediate in metabolic pathways across various organisms (37)], and citrate [an intermediate of the vital metabolic pathway, the Krebs cycle, used by all aerobic organisms to release energy (38, 39)]. In addition, we included d-ribose [the carbohydrate that serves as the backbone of RNA among various other functions (40, 41)], nicotinamide adenine dinucleotide [NAD+; a vital cofactor that carries electrons from one reaction to another (42)], adenosine diphosphate (ADP), adenosine triphosphate (ATP), and a symbolic RNA molecule. As we see in Fig. 3, there are lots of structures that are shared. We can imagine that as we include more biomolecules in the tree (e.g., various proteins and RNA and DNA sequences), more structures will be shared, and the tree will grow deeper and deeper but without growing much wider (i.e., having more hierarchical layers but not many nonrelated key blocks being added to the tree). This potentially “narrow” assembly tree is an indication that all of the vital biomolecules involved in extant life on Earth is not arbitrary but a consequence of millions of years of evolution.
The assembly tree of a dozen vital biomolecules, including the five nucleobases (A, G, T, C, and U), pyruvate, citrate, d-ribose, NAD+, ADP, ATP, and a symbolic RNA molecule.
In origin of life studies, one central problem is that most biomolecules are complex and thus seem very unlikely to emerge de novo from prebiotic chemistry, while life requires them altogether to function properly. An important clue that we gain from this study is that lots of chemical substructures are shared among these vital biomolecules. Hence, it can be argued that the set of processes producing these molecules together, as a set, could have been smaller than if they were produced de novo individually because they could have shared common pathways. The closer the biomolecules are in the assembly tree (i.e., they are more related), the easier it is to access these compounds. This is arguably why biomolecules exploited by extant life appear close in the assembly tree; otherwise, they would be too complex to emerge individually. For example, the fact that the five nucleobases AGTCU are closely related in the assembly tree indicates that it is not arbitrary nor a result of a frozen event that they serve as the fundamental units of the genetic code across all life on Earth and that the molecules may have been incorporated into protobiological systems because of their structural relatedness. We tested this idea by building assembly trees for alternated nucleobases (see section S5 for details). We found that MAs of alternated trees are always much higher (ranging from 31 to 38) than the one in reality, which is 16, as mentioned in the previous section. This analysis only indicates that the extant set of nucleobases were selected because of their relatedness, but why nature selects this particular set is another question since there could be many such sets. This intriguing and important question requires a much wider exploration of chemical space, yet we believe our methods provide a useful tool for framing this question.
Gene sequences
So far, we have focused on constructing assembly trees for molecules, but our theory can equally be applied to gene sequences. We take one hypothetical gene sequence X (60 bases; Fig. 4) as an example to illustrate how we can make use of the compressed information carried by X’s assembly tree to reconstruct itself with less efforts. The naïve method is to add one base at one time, and then, 59 steps in total are needed to construct X, but notice that some segments are repeated multiple times. If it is possible to produce these repetitive segments beforehand, then it is possible to take them directly and combine them with other segments in a precise way to obtain X, and this requires less than 59 steps [as producing repetitive segments is relatively easy (43), e.g., polymerase chain reaction, the reconstruction of the original sequence takes less efforts than the naïve method]. As the assembly tree can filter all of the repetitive and redundant information (i.e., record the information of the sequence in the most compressed way), the information of this “precise way” is completely stored in the assembly tree. Then, we build X’s assembly tree, as shown in Fig. 4 (in this example, we only deal with one sequence X, and its assembly tree reduces to its shortest assembly pathway). Note that in this gene sequence case, we use nucleobases as the basic building blocks rather than chemical bonds as in the molecule cases.
The assembly tree of one hypothetical gene sequence X (as, here, we only deal with one sequence X, it reduces to its shortest assembly pathway in the key-step representation).
Here, we use nucleobases as the basic building blocks rather than chemical bonds in the molecule cases. Thus, we explicitly draw those nucleobases at the bottom for clarification.
Before reconstructing, we introduce a way to compress the information hold in the assembly tree into a single string, which would be very useful in future studies, as the assembly tree of genomes would be huge and directly storing the tree structure is neither convenient nor efficient (note that while preserving the assembly tree information is our priority, we are not intended to defeat any sophisticated data compression technique per se). The assembly tree in Fig. 4 can be rewritten as CG_TTG_A1_GAC1G_3CTC4_1T5CAG42342543, denoted as Y. Now, we can reconstruct X based on Y. The first step is to construct CG by simply combining the individual bases C and G (one step) and construct TTG by combining the individual bases T, T, and G (two steps). Second, construct A1, where “1” stands for the first segment in Y, which is CG. Thus, we only need one step to obtain ACG since CG has been constructed before. Third, construct GAC1G, where we can reuse “1” again, and we thus need four steps. Fourth, construct 3CTC4, where we can reuse “3” (A1, the third segment in Y) and “4” (GAC1G, the fourth segment in Y), and we thus need four steps. Last, we can construct the original sequence X based on the last part of Y, namely, “1T5CAG42342543” where the integer stands for the corresponding segment that has been obtained before and can thus be reused. Thus, we need 13 steps here. In total, we need 1 + 2 + 1 + 4 + 4 + 13 = 25 steps, which is much less than the naïve 59 steps. Last, to quantify the increase of information from the original sequence X (60 letters) to the lossless and compressed version Y (36 letters), we can use Shannon entropy (44, 45), a widely used quantity to describe the information of a string. Shannon entropy H of a string X (with n letters) is defined as H(X) = − ∑xp(x) · log2p(x), where p(x) is the probability that the letter x appears in this sequence X, which is set to be equal to the times x appears in X divided by n, and the sum goes through every distinct letter. Therefore, we obtained that Shannon entropy of the original sequence H(X) is 1.851, while H(Y) is 3.251, increased by 1.76 times.
We do think that the application of assembly theory to gene assembly has potential application to not only building new routes to engineer synthetic genomes by taking a series of genes, finding the common parts, and then finding the minimal route to assemble these parts to access all of the desired genes. Not only could this be used to efficiently build new function, but it also has promise to explore how evolution has reused genetic motifs beyond the current modular understanding and perhaps find more complex and conserved routes for the propagation of genetic information across different genes.
Plasticizers
Plasticizers are added to polymers and formulations to make them more plastic, to decrease viscosity and friction, and to increase flexibility (46), but they can leach into the environment. This is a problem since these compounds have been shown to be toxic (47). One big issue is that, in general, the evaluation of specific effects and prevalence of plasticizer molecules in the environment is hard, as so many different types are in use and many degradation pathways exist (47). This means that potentially vast numbers of molecules related to the plasticizer parent are present in the environment, posing similar or even greater health risks than the parent. However, by exploring the assembly tree of plasticizer molecules (see Fig. 5), it might be possible to map the molecules that are potentially contaminated and even identify unknown or unexpected molecules.
The assembly tree of 10 commonly used plasticizers including BBP, DEHP, DEHA, and others.
For a clearer visualization, all plasticizers are made dimmer than other parts of the tree. The most central structures are highlighted green.
Here, we built the assembly tree of 10 commonly used plasticizers including BBP (benzyl butyl phthalate), DEHP (di-2-ethylhexyl phthalate), DEHA [bis(2-ethylhexyl) adipate], and others. As seen from Fig. 5, these 10 seemingly distinct molecules share lots of common structures (i.e., key assembly building blocks), which are also highly related. These 10 relatively large plasticizers are constructed by only a dozen of these key building blocks. In particular, the four central structures highlighted green directly or indirectly connect to most of them (except for tricresyl phosphate) and are highly related themselves. We can thus imagine that if we can detect and identify these central/characteristic chemical structures and fragments in an environment [e.g., using the mass spectrometry technique for assembly theory (36)], then we should be able to track down the parent molecules and substantially narrow down the list of suspected contaminants. Nevertheless, it should be noticed that we only introduced the concept here. The development of a reliable detection method would require a massive tree inclusive of all relevant compounds both in terms of possible pollutants and products expected in the absence of pollution based on the environmental conditions.
Opiates
The search of opiate-based chemical space is an important test case since the family of compounds is highly distinctive with well-defined modular parts. One idea could be to use assembly trees to explore a set of potentially biochemically accessible new structures, as well as deduce what contingent information is present therein, and use them for finding new molecular targets. To do this, we computed the assembly tree of 10 compounds in the family of opiates (Fig. 6): Some of them are found in the opium plant (morphine, codeine, thebaine, and papaverine); some are synthetic opioids (fentanyl, remifentanil, methadone, pethidine, and diamorphine, also known as heroin), and the last one is salvinorin A, which is a κ-opioid receptor agonist (48) found in the Salvia divinorum plant but might not be properly considered as an opioid. The assembly tree captures some known features of the chemical space associated with these compounds: (i) Morphine, codeine, thebaine, and diamorphine are clustered in one place, with a major structure shared. (ii) Fentanyl and remifentanil are close in the tree, as they share a large substructure. (iii) In contrast, methadone, pethidine, and papaverine are not as closely related, sharing just a relatively small benzyl moiety. (iv) Salvinorin A is distinct from others, as it only connects with other compounds via very small structures.
The assembly tree of nine compounds in the family of opiates and one κ-opioid receptor agonist (salvinorin A).
Some of these opiates are natural (morphine, codeine, thebaine, and papaverine), while others are synthetic (fentanyl, remifentanil, methadone, pethidine, and diamorphine, also known as heroin). For a clearer visualization, all opioids are made dimmer than other parts of the tree.
By computing the assembly tree of these compounds, we obtained