Wet Lab Protocol Corpus and Tagger

An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols [bib]
Chaitanya Kulkarni*, Wei Xu, Alan Ritter, Raghu Machiraju*
Proceedings of NAACL 2018 (short paper)

We describe an effort to annotate a corpus of natural language instructions consisting of 622 wet lab protocols to facilitate automatic or semi-automatic conversion of protocols into a machine-readable format and benefit biological research. Experimental results demonstrate the utility of our corpus for developing machine learning approaches to shallow semantic parsing of instructional texts.

What are Wet Lab Protocols ?

Wet labs conduct biology and chemistry experiments typically involving chemicals, drugs, or other materials in liquid solutions or volatile phases. Wet lab protocols, then are rules or instructions guiding one such experiment in these labs. They consist of a sequence of steps, mostly composed of imperative statements meant to describe an action. They also can contain declarative sentences describing the results of a previous action, in addition to general guidelines or warnings about the materials being used.

An example protocol, as seen in Figure. 1 of our paper.

Research groups around the world curate their own repositories of protocols, each adapted from a canonical source and typically published in the Materials and Method section at the end of a scientific article in biology and chemistry fields. Only recently has there been an effort to gather collections of these protocols and make them easily available. Leveraging an openly accessible repository of protocols curated on the platform, we annotated hundreds of academic and commercial protocols maintained by many of the leading bio-science laboratory groups, including Verve Net, Innovative Genomics Institute and New England Biolabs. The protocols cover a large spectrum of experimental biology, including neurology, epigenetics, metabolomics, cancer/stem cell biology, etc.

600+ Wet Lab Protocol Corpus

We present Wet Lab Protocol Dataset (WLP) composed of 13,679 sentences from 622 protocols. Each sentence were manually annotated with actions and a rich set of 17 semantic arguments along with a total of 12 relationships between them. We aim for it to serve as a resource for developing Machine learning models for automatic semantic parsing of instructional texts. Additional information regarding the action, entities and relations can be found here.

An annotated sentence from one of the protocols in our corpus. This is visualized with the help of BRAT tool.

The dataset was annotated with the help of BRAT tool. The annotation file has been maintained and stored separately from the protocol text as a .ann file, and follows the same format as used in brat tool for easy visualization. The annotation format is well described on BRAT tool website here.


