English
 
Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Journal Article

HUNER: Improving Biomedical NER with Pretraining

Authors

Weber,  Leon
External Organizations;

/persons/resource/munchmej

Münchmeyer,  J.
2.4 Seismology, 2.0 Geophysics, Departments, GFZ Publication Database, Deutsches GeoForschungsZentrum;

Rocktäschel,  Tim
External Organizations;

Habibi,  Maryam
External Organizations;

Leser,  Ulf
External Organizations;

External Ressource
No external resources are shared
Fulltext (public)

4362895.pdf
(Postprint), 979KB

Supplementary Material (public)
There is no public supplementary material available
Citation

Weber, L., Münchmeyer, J., Rocktäschel, T., Habibi, M., Leser, U. (2020): HUNER: Improving Biomedical NER with Pretraining. - Bioinformatics, 36, 1, 295-302.
https://doi.org/10.1093/bioinformatics/btz528


Cite as: https://gfzpublic.gfz-potsdam.de/pubman/item/item_4362895
Abstract
Motivation: Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora. Results: We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5 pp - 13 pp on the entity types chemicals, species, and genes. Availability: HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future.