The Lawrence Livermore National Laboratory / American Heart Association CoViD-19 Data Portal

The Lawrence Livermore National Laboratory / American Heart Association CoViD-19 Data Portal is a database of therapeutic leads for SARS-CoV-2. The portal contains in silico predictions and selected experimental validation for public small molecule libraries and designed synthetic antibodies. See Discovery of Small-Molecule Inhibitors of SARS-CoV-2 Proteins Using a Computational and Experimental Pipeline.

By providing a large set of predicted human protein-ligand interactions – with a focus on the SARS-COV-2 proteins – this portal is intended to provide insight into drug-candidate molecules with regard to both efficacy (on-target interactions) and safety (off-target interactions). 

Interaction calculations – which are in progress, and thus are available with varying degrees of completeness at this time – are provided for an initial set of ligands: Federal Drug Administration (FDA) approved drugs, “Other-world-approved” drugs; that is, drugs not in the FDA-approved set that have been approved for use by the European Union, Canada, and Japan, and compounds from the ChEMBL, eMolecules, and Enamine datasets.  Enamine used for model evaluation in Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models (SC20 Technical Program).

Data Release

Molecular docking

We use our in-house ConveyorLC program toolchain to automate the docking and rescoring of compounds against each binding site. The ConveyorLC toolchain comprises four parallel programs for protein preparation (CDT1Receptor), ligand preparation (CDT2Ligand), molecular docking (CDT3Docking), and Molecular Mechanics/Generalized Born-Solvent Accessible Surface Area (MM/GBSA) rescoring (CDT4mmgbsa). The toolchain depends on a number of external libraries, including the Message Passing Interface (MPI) library, the C++ Boost library, the Conduit library, the HDF5 library, and several molecular simulation packages, including Autodock Vina, the AMBER molecular simulation package, and MGLTOOLS. Computational results are aggregated and saved in a series of HDF5 files. A few auxiliary tools are included in the toolchain to query and extract data in the HDF5 files.

Over 76 million compounds were selected from four publicly-available compound libraries for docking.   From the ZINC database both FDA-approved and “world-not-FDA" drugs were assembled into the “World-approved 2018” set.   From ChEMBL, approximately 1.5 million unique compounds (for example, not including counterions) were used.   From eMolecules, approximately 18 M compounds (the “screening” set) were used.   From the Enamine “REAL” database of over 1.2 G enumerated structures of drug-like compounds thought to be synthetically feasible an initial set (limited by time/computer resources) of approximately 56 M compounds was selected. 

Compounds were docked against two binding pockets from the SARS-CoV-2 Spike protein (6m0j), and two conformations of the Main protease, Mpro or 3CLpro (6lu7 and 6y84).  The first Spike protein pocket is stabilized by disulfide Cys480-Cys488; this structure was subject to MD equilibration before its use as the binding target.  The second Spike site is in the proximity of the beta-turn formed by residues 501-505.

The CDT3Docking in ConveyorLC toolchain is based on Autodock Vina (Version 1.1.2) and uses MPI and multithreading hybrid parallel scheme. The docking grids of the binding sites were determined by the protein preparation program in the toolchain.  Compounds were prepared for docking in the following manner. SMILES strings and 2D SDF structures were imported into the Molecular Operating Environment (MOE) for removal of salts and metal-containing ligands, protonation states were set to the dominant form at pH 7, 3D structures were created and minimized, and relevant MOE descriptors were calculated.  The final structures were exported from MOE as SDF files.  These structures were then further processed by the ligand preparation in the toolchain by utilizing antechamber and the GAFF force field from the AMBER13 simulation package.

Single-point GBSA Rescoring

Protein-compound complexes were rescored using CDT4mmgbsa in the ConveyorLC toolchain. A total of ~10 million poses were rescored for each binding site because since each complex typically has10 docking poses. CDT4mmgbsa employs a master-worker parallel scheme, where the master is in charge of job dispatching and each worker receives jobs from the master and performs an MM/GBSA calculation using the AMBER sander program. The AMBER force field (amberff14SB) was used for the proteins; the apo proteins’ MM/GBSA energies were previously determined in the CDT1Receptor step. Partial atomic charges for the compounds were computed by antechamber using the AM1-BCC method; each compound’s charges were previously calculated by the CDT2Ligand step. An energy minimization – 1000 steps of steepest descent and 1000 additional steps of conjugate gradient – was performed on each docked compound-protein complex using the modified generalized Born model of Onufriev, Bashford, and Case with model 2 radii (igb=5) with a nonbonded cutoff of 25 Å. The MM/GBSA energy of the minimized complex structure was calculated using an infinite cutoff (999 Å) and a protein dielectric constant of 4. The binding affinity was computed by MM/GBSA energy of the complex less the sum of the MM/GBSA energies of the apo protein and the compound.

“Fusion” machine learning model

The Fusion models for Atomic and molecular STructures (FAST) predict the binding affinity of a protein and ligand by combining the predictions of a 3D convolution neural network model (3D-CNN) and a spatial-graph convolutional neural network (SG-CNN) model of the protein-ligand complex.  These FAST models were trained on the PDBbind database of protein-ligand solved structures accompanied by experimentally-determined binding affinity values.  See Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference. SC21 Technical Program High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models.

This modeling approach learns binding interaction rules directly from atomic structure representations without the need for hand-curation of features intended to capture the mechanism of binding.  The 3D-CNN models atoms and their features in a 3D voxel representation. The 3D-CNN representation implicitly models pairwise relationships between atoms through the relative positioning of atoms in a 3D voxel grid, but does not pre-determine which atomic interactions to represent other than defining a minimum atomic resolution. This comes at the cost of having to learn a larger number of parameters to represent the grid. In contrast, the SG-CNN uses an explicit distance threshold to determine which pairs of atoms to consider in pairwise interactions, with the potential benefit of using fewer parameters in the model.

Fusion models combine different input modalities and learn their feature representations together.  Several fusion models have been proposed for the task of video activity recognition by bridging the dimension difference among multiple input modalities (e.g., visual and temporal data).  Our fusion model combines the independently trained 3D-CNN and Spatial Graph CNN models.

Data pre-processing and feature extraction.  Protein-ligand complex charges and protonation states are solved using UCSF Chimera with AMBER ff14SB for standard residues and AM1-BCC for non-standard residues. 

Feature extraction.  A common atomic representation was used for input to the structure-based deep learning models. The representation captures the properties of a typical organic molecule (proteins and ligands) and includes:

  • Element type: one-hot encoding of B, C, N, O, P, S, Se, halogen or metal (9 bits)
  • Atom Hybridization (1, 2, or 3) (1 integer)
  • Number of heavy atom bonds (i.e. heavy valence) (1 integer)
  • Number of bonds with other heteroatoms (i.e. hetero valence) (1 integer)
  • Structural properties: bit vector (1 where present) encoding of hydrophobic, aromatic, acceptor, donor, ring (5 bits)
  • Partial Charge (1 float)
  • Molecule type to indicate protein atom versus ligand atom (-1 for protein, 1 for ligand) (1 integer)
  • Van der Waals radius (1 float)

The OpenBabel cheminformatics tool (version 2.4.1) was used to extract the features for all binding complexes. Finally, all coordinates are centered by each ligand to produce the spatial representation of the binding complex.

Data.  The PDBbind 2016 “general” set (13,308 complexes), including the 4057 “refined” set (4,057 complexes), was used as training date.  The “core” set (290 complexes) was held out for testing.  The training set was split for a 90/10 training/validation regimen.

3D-CNN implementation.  The input volume dimension is N × N × N × C where N is the voxel grid size in each axis (48 in our experiment) and C is the number of atomic features. The volume length in each dimension is approximately 48Å where each voxel size is 1Å, which is sufficient to cover the entire pocket region while minimizing the collisions between atoms. Each atom is assigned to at least one voxel or more, depending on its Van der Waals radius. In the case of collisions between atoms, we apply element-wise addition to the atom features. Once all atoms are voxelized, Gaussian blur with σ = 1 is applied in order to populate the atom features into neighboring voxels.

SG-CNN implementation.   A straightforward definition of a molecular graph considers atoms as nodes and bonds as edges between the nodes. Such a 2D representation may be suitable for the modeling of chemical graphs; however it fails to capture non-covalent interactions (e.g., electrostatic interactions between the ligand and the protein binding pocket) that are necessary to model more complex biological structures. Atomic Convolutional Neural Networks (ACNNs) allow for the specification of a “local” neighborhood for each atom that is based on euclidean distance, effectively relaxing the covalency requirement to form edges in the graph representation of a protein-ligand binding complex. We use the PotentialNet architecture, which refines this idea by allowing for an arbitrary number of edge types and applies specialized update rules that are a generalization of Gated Graph Sequence Neural Networks.

Fusion model.  Models to combine multiple input sources or different feature representations have been applied to a number of computer vision applications, especially in the presence of multi-modal images or different image sensors. These fusion models benefit from multiple feature representations which are considered complementary to each other. We use a separate fusion neural network to combine feature representations from two independently trained models (3D-CNN and SG-CNN), each of which has its own strengths and weaknesses. The heterogeneous feature representations that the two models capture can enrich the fusion model’s features. 

Among several ways to fuse models, we adopt mid-level to late fusion approaches and coherent fusion. The mid-level approach uses intermediate-layer output activations from the 3D-CNN and SG-CNN models.  These are treated as input features passed to a fully connected layer, and the outputs are concatenated with the original input features before the next fully connected layer of the fusion model.

The late-fusion approach averages the final predictions of the CNN models. This approach is simple, but effective to combine multiple model predictions.

Whereas the mid-level fusion model trains each of the three constituent models 3D-CNN, SG-CNN and fused layers separately, the coherent fusion trains all layers together. In addition, the coherent-fusion model was trained on PDBBind 2019 (not 2016).

MD-trajectory-average GBSA

For a selected set of lead molecules – WorldApprovedDrugs_2018 with docking scores better than -7.5 kcal/mol – the pose that ranked best by the single-point MM/GBSA calculation was used as a starting point for a molecular dynamics (MD) simulation.  Overall-average and window-average GBSA scores were calculated over a 200-ns simulation trajectory.

The MD simulations were performed using the pmemd_cuda program in AMBER. The catalytic dyad (His41-Cys145) of the main protease was modeled as charged residues.  Charges for the thiolate of Cys145 were obtained from AM1-BCC calculations. The General Amber Force Field (GAFF) was used to model the ligands. The ligand-protein complex was solvated into a truncated octahedron of TIP3 water; 50 Na+ ions with a neutralizing number of Cl- ions were added to the solution. The system was energy minimized with 500 steps of steepest descents and 1500 steps of conjugate gradients. Initial equilibration was performed with NVT dynamics at 300K for 200 ps with positional constraints (K = 1 kcal/mol•Å2) on the CA atoms in residues. Electrostatic interactions were treated using Particle Mesh Ewald (PME) summation. The nonbonded interactions were cut off at 8 Å.  Further equilibration was performed with NPT dynamics for 4.8 ns. The pressure was set at 1 atm using a Monte Carlo barostat. The positional constraints were reduced to 0.5 kcal/mol•Å2). Production dynamics was performed for 200 ns without positional constraints. The MM/GBSA energies were calculated using utilizing the Generalized Born model of Onufriev, Bashford, and Case (igb=5) with a dielectric constant of 4.0 on coordinates saved every 20 ps.  

Safety and pharmacokinetic property predictions

For each molecule safety and pharmacokinetic property values have been predicted with models generated with the ATOM Modeling Pipeline (AMPL) and Maestro workflow manager. The AMPL software framework automates the machine learning process, including a number of molecular featurization and machine learning tools.  Models for a number of different safety indicators and pharmacokinetic properties were developed based on experimentally-determined values from both human and animal assays.  Results from twelve of those models are available here.

The following table describes the models/predictions available.  The “Model/features” column indicates the type of data used in each model to represent the molecule; “moe” indicates molecular descriptors calculated by either the Molecular Operating Environment software suite, or by RDKit; “ecfp" indicates representation by an extended-connectivity fingerprint calculated from each molecule's chemical structure using RDKit;  “graphconv" indicates a spatial-graph convolutional neural network representation derived from each molecule's chemical structure.  The “Model type” column indicates whether the model predicts a continuous value (a real number) – “regression” – or a discontinuous (binary classification) value – “classification”.

Model data/assay Description Model/features Model type
Human microsomal clearance model 2 log10 human microsomal clearance rate, alternate model graphconv regression
Aurora kinase A pIC50 for aurora kinase A inhibition graphconv regression
Aurora kinase B pIC50 for aurora kinase B inhibition ecfp regression
Muscarinic acetylcholine receptor M1 pIC50 for muscarinic acetylcholine receptor M1 inhibition graphconv regression
Muscarinic acetylcholine receptor M2 pIC50 for muscarinic acetylcholine receptor M2 inhibition graphconv regression
Muscarinic acetylcholine receptor M3 pIC50 for muscarinic acetylcholine receptor M3 inhibition graphconv regression
Histamine receptor H1 pIC50 for histamine receptor H1 inhibition graphconv regression
KCNA5 pIC50 for Kv1.5 potassium channel inhibition graphconv regression
SCN5A NaV1.5 model 2 pIC50 for sodium channel NaV1.5 inhibition, alternate model graphconv regression
Human VDss model 2 log10 steady state volume of distribution in humans moe regression
Bile Salt Export Pump model 2 pIC50 for bile salt export pump inhibition moe regression
KCNH2 graph model 2 pIC50 for hERG potassium channel inhibition graphconv regression

Molecular descriptors

Molecular descriptors are numerical calculations based on the structure and physicochemical properties of a molecule, including elements, formal charges, and bonds.  Results from eighty-one of those descriptors are available here. A number of descriptors have been calculated with RDKit.

For detailed definitions see QuaSar-Descriptor. Brief summaries are provided here.

PEOE - partial charges, Gasteiger method   PC+ - total positive
  RPC+ - relative (largest single-atom partial charge compared to PC+)
  VSA - van der Waals surface area
  POS - over positively-charged atoms
  PPOS - over atoms with charge > 0.2
  FHYD - hydrophobic (atoms with abs(charge) < 0.2)
  POL - polar (atoms with abs(charge) > 0.2)

SMR - molar refractivity (polarizability)
  SMR_VSAn - sums of surface area for atoms with SMR in various ranges

SlogP - Log of the octanol/water partition coefficient.
  SlogP_VSAn - sums of surface area for atom with SlogP in various ranges

TPSA - total polar surface area


Results available in the Lawrence Livermore National Laboratory / American Heart Association CoViD-19 Data Portal include protein structural models, protein-ligand complexes showing docking poses, and protein-ligand complexes showing refined GBSA binding poses (as mentioned above, in the GBSA calculations an additional step of energy minimization is applied to the docking poses).  These can be downloaded in PDB-file format (“.pdb”). 

Data Release

1.3 expected 2021

  • Compound Group: Enamine
  • Experimental validation
  • Safety property predictions
1.2 July 2021
  • Compound Group: Enamine
  • Calculation Type: Coherent-Fusion
1.1 July 2020
  • Compound Group: eMolecules
  • Compound Feature: Top Rank
  • Safety and pharmacokinetic property predictions
1.0 May 2020
  • Compound Group: ZINC database, ChEMBL
  • Compound Feature: Functional Groups, Chemical Descriptors
  • Calculation Type: Docking, GBSA, 3D-CNN, SG-CNN, Mid-Fusion, Late-Fusion
  • Homology model protein structures
  • Synthetic antibodies

Models/methods: Molecular docking | Single-point GBSA | “Fusion” machine learning model | MD-trajectory-average GBSA | Safety and pharmacokinetic property predictions