发布者:抗性基因网 时间:2023-06-08 浏览量:208
摘要
随着新的耐药性机制的出现和在全球范围内的传播,抗生素耐药性已成为世界范围内日益关注的问题,因此检测和收集病因——抗生素耐药性基因(ARGs)比以往任何时候都更加重要。在这项工作中,我们的目标是通过从科学论文中提取与ARG相关的断言来自动化ARG的管理。为了支持这一方向的研究,我们构建了SciARG,这是一个新的基准数据集,包含2000个手动注释语句作为评估集,以及12516个银标准训练语句,这些语句是根据一组规则从科学论文中自动创建的。为了在SciARG上建立基线性能,我们利用了三种最先进的基于预先训练的语言模型和即时调整的神经架构,并进一步将它们集成在一起,以获得最高的77.0%F分数。据我们所知,我们是第一个利用自然语言处理技术从科学论文中整理所有经过验证的ARG的公司。代码和数据均可在https://github.com/VT-NLP/SciARG.
Abstract
Antibiotic resistance has become a growing worldwide concern as new resistance mechanisms are emerging and spreading globally, and thus detecting and collecting the cause – Antibiotic Resistance Genes (ARGs), have been more critical than ever. In this work, we aim to automate the curation of ARGs by extracting ARG-related assertive statements from scientific papers. To support the research towards this direction, we build SciARG, a new benchmark dataset containing 2,000 manually annotated statements as the evaluation set and 12,516 silver-standard training statements that are automatically created from scientific papers by a set of rules. To set up the baseline performance on SciARG, we exploit three state-of-the-art neural architectures based on pre-trained language models and prompt tuning, and further ensemble them to attain the highest 77.0% F-score. To the best of our knowledge, we are the first to leverage natural language processing techniques to curate all validated ARGs from scientific papers. Both the code and data are publicly available at https://github.com/VT-NLP/SciARG.
https://aclanthology.org/2022.bionlp-1.40/