General Lab Information

Machine Learning Group

Machine-Learning-based Table Extraction for the ENSDF Modernization Project

CSI is developing machine learning (ML) and natural language processing (NLP) methods for automatic table extraction from non-machine-readable documents. Tables are ubiquitous and high-density information resources, yet their contents often cannot be accessed or processed in an automated manner. This is particularly a problem for PDFs, which are not machine-readable (a problem that is apparent to anyone who has attempted to copy and paste contents out of a PDF). Current tools for table extraction from PDFs are rule-based, unscalable, and produce many extraction artifacts. This approach is vision- and NLP-based and automatically detects tables, their contents, and their structure to enable automated extraction at scale. This work is part of an ongoing modernization effort for the ENSDF database that is managed by Brookhaven Lab’s National Nuclear Data Center (NNDC)1. As part of the nuclear data pipeline, ENSDF is maintained by expert evaluators who collect, evaluate, and disseminate nuclear physics data sourced from published experimental results. This work includes gathering tabular contents from many published PDFs and, in practice, often means spending significant time fixing table extraction errors before being able to move on with the evaluation work. This ML-based table extractor presents a valuable opportunity to accelerate this part of the workflow and may similarly serve to accelerate and scale many other document processing efforts.

Existing tools for table extraction from PDFs work by reading a PDF file’s underlying encoding, filtering page elements contained within a table’s bounding box (often selected manually), and applying a complex series of heuristic rules to reconstruct the table in an editable form (e.g., CSV or XML). This approach cannot scale and is prone to many extraction errors. These errors are an inconvenience when dealing with a handful of tables as they can be manually fixed in a few minutes. However, when many documents—hundreds to millions—must be processed, these errors present an insurmountable challenge.

Rather than processing a PDF’s underlying encoding, the table extraction task is treated as a visual problem by applying visual ML methods to first detect all tables in a document then extracting and ordering their contents. This approach uses TATR (table transformer), built on the Detection Transformer (DETR) architecture and pre-trained on approximately 1 million machine-annotated tables2,3. TATR also enables mapping the critical structure recognition task as a specialized object detection problem via auxiliary annotations, affording an improvement in table structure quality. We are evaluating the adaptability of several optical character recognition (OCR) models to read the table cell contents, which may include super/sub-scripts and non-Latin characters. Together with NNDC, CSI has developed a web interface and its supporting ML backend for ENSDF evaluators to use in their evaluation workflow. When users drop in a PDF document, all tables are automatically detected and made selectable. Upon selection, the tool generates an easy-to-copy HTML or CSV version of the extracted table contents. The web interface connects to a back-end ML server that processes the PDFs with the trained models and returns extraction results. CSI continues to optimize this ML server and making various fixes and improvements to the PDF processing pipeline and user interface.

Publications

  1. Hayes, E. McCutchan, S. Yoo, A. Mattera, S. McCorkle, B. Shu, A. Sonzogni, C. Soto, S. Zhu, F. Kondev et al., “Modernization and expansion of the evaluated nuclear structure data file database (ENSDF),” Bulletin of the American Physical Society, vol. 66, 2021.
  2. Smock, R. Pesala, and R. Abraham, “PubTables-1M: Towards comprehensive table extraction from unstructured documents,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642, 2022.
  3. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision, pp. 213–229, 2020.