Because you are not running JavaScript or allowing active scripting, some features on this page my not work. >> Enable Javascript <<

Electron-Ion Collider

Support Orgs Dept. Codes

COVID-19 Science & Technology Working Group

Brookhaven Lab's COVID-19 Science & Technology Working Group is coordinating ideas and activities across the Lab to ensure that we bring the Laboratory’s strengths and talents to help fight the pandemic. Working Group members represent the range of research and support operations at the Lab and are working to advance key research areas while also exploring more immediate help.

We are soliciting feedback from the greater research community on how Brookhaven Lab can support COVID-19 research efforts and we invite our colleagues to contact us below.

For all researchers

For Brookhaven staff

Resources available to support COVID-19 research

Structural Biology

A Rapid Access form is available to submit a proposal for COVID-19-related research on NSLS-II structural biology beamlines. See other structural biology resources across the DOE complex.

Nanotechnology

For information about access to Brookhaven’s Center for Functional Nanomaterials or to seek a collaboration with CFN researchers, see the CFN website.

Computation

For information on simulations, machine learning, and other artificial intelligence tools, see Brookhaven's Computational Science Initiative website.

Computational project details

PDBQT Docking Study

The docking programs require molecular structure in the Protein Data Bank with Partial Charges and Autodock Atom Types (PDBQT) format. Converting simplified molecular-input line-entry system (SMILES) strings into PDBQT files is, in principle, simple and involves running OpenBabel followed by Autodock Tools' Python script "prepare_ligand4.py." However, converting hundreds of millions of SMILES strings makes this a non-trivial operation, generating terabytes of data.

Molecules Docking Study

The COVID-19 project at Argonne aims to use a multi-stage approach to honing in on a set of promising drug candidates. The first stage involves RL-DOCK, a reinforcement learned docking score predictor. Another stage involves DeepDriveMD, which is a machine-learning-driven molecular dynamics simulation engine. Furthermore, the project expects to use a consensus score to combine outcomes from RL-DOCK and multiple docking methods. Training and tuning these methods can be facilitated by providing a set of examples. For this purpose, we (BNL) are docking 300,000 molecules using Autodock4 given as a set of simplified molecular-input line-entry system (SMILES) strings.

Search and Mining System for COVID-19-related Literature via Natural Language Processing

CSI’s natural language processing (NLP)-driven information extraction work is divided roughly along three timescales: Immediate rapid effort. Parse and filter sentences from literature data sets, extract features with available pre-trained language models (e.g., BioBERT; SciBERT), and apply sentence classification techniques to identify text containing information and properties of interest (e.g., toxicity or immune response). Note that the available data sets are too small to fully train deep language models, so pre-trained models are used and fine-tuned for specific tasks (sentence classification). This requires labeled data, which takes significant time to annotate manually. Currently, CSI is applying coarse keyword-based automated labels, using terms provided by domain experts and boosted with similar terms found in Word2vec embedding space.

Mid-term. More specialized NLP tasks, such as named entity recognition (NER), and entity linking to support higher-level inference, such as automatic relation extraction, are being explored. We also are establishing a coordinated data annotation effort with Oak Ridge National Laboratory (likely using the open-source Doccano tool).

Long-term. We aim to improve performance by training more sophisticated language models (e.g., XLNet, ERNIE 2.0, and T5) on large-scale scientific text data sets (e.g., PMC) then fine-tuning them on domain-specific article sets.

Natural Language Processing: Neural-network-based language model

Gilchan Park, a post-doctoral research associate with CSI, has collected a body of publications related to COVID-19, as well as similar prior outbreaks such as SARS and MERS. Some of the publications where made available recently through a White House Initiative. However, other publishers followed suit, and there now is free access to more than 21,354 (9,069 without replicates from the COVID-19 Open Research Dataset [CORD] publications. More than 15,000 of these papers were published in the last three months, making it impossible for any human to digest, evaluate, and prioritize the content for relevance to their research.

This project aims to build a neural-network-based language model trained on the collection of COVID-19 articles, which is designed to extract text segments relevant to domain-specific queries (e.g., what drugs are in clinical trials for COVID-19?) from the literature. These data can be used to provide annotators with filtered list of text segments to create training samples for questions that can answer systems and complicated classification tasks.

At BNL, CSI’s new data set features articles that are automatically updated every Saturday on the BNL Provenance server. For now, data from Elsevier and Springer are available to share with other national labs, and PMC data are available to everyone because they are open-access articles.

Natural Language Processing: Keyword searching

Samuel Chen, a senior technology analyst with CSI, is working on Word2vec and BM25 generative models (both used for machine learning and data mining) for keyword searching. However, the backend can be upgraded to support Carlos Soto and Gilchan Park's, both from CSI, natural language processing (NLP) model, which should be more accurate.

Neural Fingerprint Method for Chemical Compound Characterization

CSI implemented a neural fingerprint method that can enable finding and comparing similar drug/chemical compounds quickly. It is important to have these capabilities as there are more than 10 billion compounds in the search space. This method currently is in the process of being integrated into an Argonne National Laboratory workflow.

Simplified Molecular-Input Line-Entry System (SMILES) Searching

Samuel Chen, a senior technology analyst with CSI, developed a tool and user interface that searches for simplified molecular-input line-entry system (SMILES) strings in publications. The service currently is being set up at Brookhaven's Scientific Data Computing Center (SDCC) and will be accessible to collaborators by April 01, 2020. He also aims to link with Ray Ren's (also from CSI) neural fingerprint work. That is, when searching the articles, try to output some related chemicals. In addition, Chen also is working on linking the search results with PubMed articles and auto tagging the article with some open data (e.g., DBPedia).

Drug and Vaccine AI/ML Toolkit

This project features a high-throughput pipeline of open-source artificial intelligence/machine learning (AI/ML) tools and conventional physics-based simulations that accelerate drug and vaccine development. The “pipeline” is comprised of four distinct stages, where each is a self-contained workflow. Together, the pipeline rapidly filters, ranks, and searches for small molecules across widely available chemical libraries and integrates virtual screening (computational drug discovery methods) techniques. It accelerates adaptive conformational sampling of the viral proteins to identify potential novel binding site pockets that can be targeted by small molecules. The code for these can be found at: https://github.com/2019-ncovgroup/DrugWorkflows.

Collaborators at Argonne National Laboratory have used these pipelines to advance millions of drug candidates through initial stages and have produced an initial draft of promising candidates.

Neural Fingerprint Method

ExaLearn Exascale Computing Project COVID-19 Response

Using the state-of-the-art machine learning methods, ExaLearn, an ECP Co-design Center for Exascale Machine Learning Technologies, will develop computationally fast and highly accurate surrogates to emulate the extremely expensive, large-scale epidemiological simulations currently in use. These surrogate models then will be used for accelerating the training of reinforcement learning algorithms designed to develop policies that allow decision makers to understand how to apply mitigation strategies over time. ExaLearn also will develop machine learning models to design and predict multiple properties for proposed drug molecules (e.g., toxicity, solubility, etc.). ExaLearn will deploy third-wave artificial intelligence (AI) to rapidly generate libraries of small molecules with the potential to inhibit essential enzymes in the COVID-19 genome. As part of this ExaLearn effort, Brookhaven will be closely involved in the epidemic and, most likely, molecular design aspects.

KBase (Predictive Biology)

Discussions have started to determine if KBase, DOE's Systems Biology Knowledgebase, could become a repository for experimental COVID-19 results. The BNL team would contribute to providing artificial intelligence (AI)-based pipelines for data analysis.