Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework

Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The... This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. Corresponding author: Prof. Fabio Polticelli Department of Biology, University of Roma Tre Viale G. Marconi 446, 00146 Rome, Italy E-mail: polticel@uniroma3.it Phone: +39-06-57336362 Fax: +39-06-57336321 The PRAISED implementation has been tested against several known annotated corpora, such as the Medstract Gold Standard Corpus, the AB3P Corpus, the BioText Corpus and the Ao and Takagi Corpus, obtaining significantly high levels of recall and extremely fast performance, while also keeping promising levels of precision and overall f-measure, in comparison to the most relevant similar methods. This comparison has been carried out up to Phase 2, since those methods stop at expanding abbreviations, without performing any entity recognition. Instead, the entity recognition performed in the last phase provides PRAISED with an effective strategy for protein discovery, thus moving further from existing context-free techniques. Furthermore, this implementation also addresses the complexity of full-text papers, instead of the simpler abstracts more generally used. As such, the whole PRAISED process (Phase 1, 2 and 3) has been also tested against a manually annotated subset of full-text papers retrieved from the PubMed repository, with significant results as well. KEYWORDS: proteins, abbreviations, data mining, extraction, resolution. Introduction Abbreviations, in the form of acronyms, aliases or simply short versions for longer names, are commonly scattered all over the scientific publications. Researchers have to constantly deal with abbreviations both when reading and writing a paper, either for understanding and tracing back their meaning or for coming up with new ones to be included in their next publication. In the biomedical community, consistent nomenclature of proteins and their corresponding abbreviations is of utmost importance for knowledge dissemination and gene/protein sequence database searching and retrieval. However, there are no generally accepted rules for naming novel proteins and abbreviating the corresponding names, and writing guidelines or suggested best-practices are often ignored and disregarded. There are indeed many ambiguities with different proteins with similar names sharing the same abbreviation. Therefore, an additional layer of complexity has been placed on top of the already unstructured data which any plain text is usually made of. This is a major problem both in the biomedical literature and in sequence databases, generating confusion and errors. Furthermore, genomic initiatives have led to the discovery of a tremendous number of novel proteins and, as a consequence, to an explosion of protein name abbreviations whose correct resolution requires a clean and effective strategy. Tackling this problem is no easy feat, given the data deluge itself and the inner complexity of the biological domain, which features thousands of often ambiguous short names and acronyms for proteins as well as for small molecules, chemicals and so forth. Further complexity is represented by the increasing use of protein names made up of numbers and letters which have little or no reference to the actual structural and functional class of the protein itself (e.g. p53BP2, p53 binding protein 2). Generalist resources such as the web portals Abbreviations.com or Acronymfinder.com, put to a test with some common protein abbreviations, demonstrate to be unable to address the complexity and the chaotic character of protein abbreviations, and thus more complex approaches are needed to solve this problem. In addition, static resources are inadequate for a field such as protein science, in which the number of known proteins increases almost exponentially with time. In this area, several research groups have proposed a certain number of methodologies for trying to discover acronyms within a source text, ranging from general approaches to more specific techniques. These include the use of regular expressions [1], linguistic cues and pattern-based recognitions [2,3, 4,5,6,7], as well as machine learning algorithms, natural language processing and mixed methods [8,9,10]. Earlier approaches [2,3,4] limited the identification patterns to all uppercase words or words with at least an uppercase letter, and the resolution patterns to strictly adjacent words whose initial letters only could participate in the abbreviation matching. Others restricted acronyms to parenthesisenclosed words [1, 6], and placed limits on capital letters and word length [5]. The well-known algorithm proposed in [6] achieved 96% precision and 82% recall on the Medstract Gold Standard Corpus. [5] reached 98% precision and 95% recall on a very small testing set. [9] focused on single words between brackets using dynamic programming to check for an abbreviation explanation to the left of the bracket-enclosed word, scoring 80% precision and 83% recall on the Medstract corpus. Some subsequent proposals [7,10] shifted their attention more specifically towards the biomedical world. In detail, [7] used pattern-matching rules based on the correspondence between the initial characters of contiguous words and the abbreviation letters, and obtained an average 95% precision and 70% recall on a small set of biomedical papers. [10], instead, took advantage of the method proposed in [11] to first identify proteins within a text, and then, assuming the identified protein names were all correct, to try and map those names to their corresponding abbreviations. They claimed 98% precision and 96% recall on biomedical abstracts, under however the assumption of correctness of the previous protein name identification step. More recently, [12] tried to build an abbreviation repository by using a machine learning approach to extract and resolve short form-long form (also called SF-LF from now on) pairs. The tool they developed (which is still available online, unlike most of the systems just mentioned) focused nevertheless on paper abstracts, and built its abbreviation archive accordingly. Its resolution rules fall short when a complex full-text paper is provided as its input, showing less accurate results. Furthermore, they stopped at matching abbreviations with their corresponding explanations, without trying to resolve the matched explanations as entities of the biological domain. A similar tool, based instead on custom patterns, is mentioned in [13] and is currently available as well. The study in [14], although unpublished at the time of this writing, accurately compared the resolution methods of [6], [12] and [13] on the BioText, AB3P and A&T corpus, being the major currently available approaches to the abbreviation resolution problem. This study will be later used in the paper as a basis to compare the results obtained. Finally, some broader-scoped scientific initiatives in the biomedical domain (like the effort made in [15] are worth mentioning, embracing a substantial number of information retrieval problems, like text categorization, question-answering and the automatic search for logical/conceptual relationships among biomedical concepts. In comparison, the approach proposed in this paper concentrates on a more specific information extraction task, so that it might somewhat order the chaotic domain of protein abbreviations. To sum it up, most of the existing approaches tackling a similar problem, as pointed out so far, impose strong constraints upon the candidate abbreviations, or employ fixed recognition patterns when trying to discover their origin within the text, thus severely limiting recall. As a consequence, their scope is usually restricted to abstracts only, where a narrower variety of abbreviation forms can be detected. Also, performance in terms of execution time is seldom mentioned or not at all, and the overall results are not so easily comparable, for often modified corpora are used to test the resolution algorithms. The major abbreviation resolution approaches, as said earlier, also stop at the abbreviation expansion and did not bother matching their explanations with biomedical entities. This paper describes a lightweight strategy for identifying and resolving protein abbreviations found in the full text of biological papers, implemented in a framework for the Identification, Disambiguation and StoragE of Protein-Related Abbreviations (PRAISED). The PRAISED execution consists in a three-phase process where (i) candidate abbreviations are detected within a full text, based on lexical clues and exclusion rules (Abbreviation Identification); (ii) abbreviations (SF) are matched with their potential explanation (LF), using syntactical and semantic criteria combined with fitting optimization techniques (Abbreviation Resolution); and (iii), resulting SF-LF pairs are sorted out according to the domain of interest (Protein Name Identification). A preliminary outline of the first two phases of this process has been previously introduced in poster paper form in [16]. In this regard, it is important to underline that the PRAISED discovery process has been designed for and built upon the biological domain; even so, as far as the abbreviation identification and resolution steps are concerned, it features strategies which might in principle be applied and extended to other domains of interest, such as medical abbreviations. In fact, only a minimal domain-specific tuning is performed during the first two phases, and most of the criteria used are not strictly tied to the considered domain. The PRAISED framework (up to phase 2) has been tested against four annotated corpora: the Medstract Gold Standard [17], the AB3P [18], the BioText [19] and the A&T Corpus [20], obtaining results comparable to other available abbreviation resolution methods, despite the computationally lightweight character of the PRAISED implementation. Specifically, very high levels of recall have been detected, often higher than similar approaches, along with promising results of precision and consequently f-measure, the latter being a fundamental parameter to assess the quality of information extraction systems. The whole PRAISED execution has been tested against a manually annotated set of randomly chosen full-text articles, where it clearly showed all of its resolution potential. From this viewpoint, the uniqueness of PRAISED relies on its ability of approaching the problem of abbreviation identification and resolution on full-text articles. In fact, PRAISED is able to achieve results on full-text papers comparable to those obtained with other tools on the much simpler problem of paper abstracts. This is not a matter of a mere difference in length. In fact, full-text papers display a far higher level of complexity, in terms of the disparate forms of the protein names and abbreviations featured in them, than their respective abstracts. Methods As already reported in the Introduction section, the PRAISED execution consists of a three-phase process. These phases, along with an overview of the system architecture, are depicted in Figure 1. In the first phase (Abbreviation Identification) candidate abbreviations are detected within a full text; in the second phase (Abbreviation Resolution) abbreviations (SF) are matched with their potential explanation (LF); in the third phase (Protein Name Identification) resulting SF-LF pairs are sorted out according to the domain of interest. The methodological details of each phase are described in detail below. Phase one: Abbreviation Identification The first phase of the PRAISED process consists in identifying abbreviations within a text and is mainly based on a series of syntactical and lexical checks. Overall, this phase involves a tokenization of the input text, a ranking assignment for each of the considered words and the selection of candidate abbreviations based on a score threshold. Figure 1. System architecture of the PRAISED framework and high-level overview of the whole abbreviation discovery process. Exclusion rules A preliminary cleaning of the text is performed in order for a certain set of skippable words not to be passed to the actual ranking process. This is done first by applying general-purpose rules, which are meant to remove stop-words (and, of, or etc.), a list of known, recurring non-acronym words (Fig., Table etc.), and those derived from some known patterns to be excluded (like words containing no letters, or one character-long terms). Besides, a collection of domain-oriented exclusion rules are also applied, so that the subset of words to be later passed to the resolution phase is even more accurate. Thus, amino acids (Tyr382, Asp383 etc.), ions (Fe2+ etc.) and nucleotides (GTX, A(G/A)(A/G/T) etc.) are told apart, which might be wrongly captured by the subsequent lexical checks for possessing an abbreviation-like form. These rules, specifically tuned for biological papers can be swapped in and out and other rules for different domains can be defined. After the application of the exclusion rules, the ranking process takes place, via the consecutive checks listed below. Length check and decomposition of compound words The first check is related to the word's length, in order to establish whether the considered word has a reasonably short length. If that is the case, rank is increased accordingly, since there is a very high chance for an abbreviation to be sufficiently short. Here, the presence of special suffixes and compound words, especially delimited by linguistic elements like slashes, is checked as well. If suffixes are found (the most notable of which being like), they are removed and the remaining word is ranked according to the subsequent metrics. If the system is dealing with a composite abbreviation (e.g. LysoPAFAT/LPCAT2), the latter is split accordingly and the single words building it up are individually ranked, eventually resulting in a ranking that will be the computed average of the individual ones. Plain bracket check The second check tests a word for the presence of plain brackets, either left, right or both. Rank is increased upon successful discovery of parentheses, proportionally to their number (a word perfectly enclosed within brackets is very likely to be an abbreviation), and the fully or partially bracket-enclosed words are cleaned of brackets and passed to the final ranking stage. Multiple lexical checks The final ranking step consists of a composition of syntactical metrics based on linguistic elements, and is made up of the intertwined checks (where a single word can pass several checks) listed in Table 1. Those words ending up with a rank 6 after all the aforementioned checks are selected and passed to Phase 2. Along with each candidate abbreviation, a set of contiguous words are stored, to be subsequently used in the resolution phase as the search space for the potential abbreviation origin. The size s of each set of words (either left or right with respect to the candidate abbreviation) is dependent on the candidate abbreviation's length, resulting in n + k, where n is the abbreviation's length and k is a configurable factor. Manual testing and tuning have resulted in the current assignment of the value 2 for such a parameter. Phase 2: Abbreviation Resolution The second phase of the process is responsible of trying to match a candidate abbreviation with its potential explanation among its contiguous words previously stored. Pre-processing: detection of the abbreviation's building elements Before the abbreviation resolution passes, a pre-processing step is performed, where the considered abbreviation is split into its basic sub-elements, roughly corresponding to each of its characters. In fact, letters are individually split but contiguous digits are treated as a single unit. For instance, the elements resulting from Cyp33 will be C, y, p and 33. Table 1. Syntactical metrics checks used for the final ranking step of the abbreviation identification phase (Phase 1). UL, uppercase letters; LL, lowercase letters. Syntactical check all UL all UL + at most a trailing lowercase s all UL + numbers with at least a letter all UL + dashes or underscores with at least a letter more UL than dashes or underscores (if the previous check succeeds) all UL or numbers or dashes or underscores with at least a letter more UL than numbers or dashes or underscores (if the previous check succeeds) some LL more UL than LL (if the previous check succeeds) initial UL (if "some LL" check succeeds) some numbers (if the previous check succeeds) Example ABCD ABCDs 1ABCD, A123, 1A32 A{ -ABC 1A{ -ABC 1-2 abcdA aBCDEf) Abcde Ab12cd3 Rank increase +4 +3 +4 +1 +2 +2 +2 +0 +3 +1 +3 The purpose of the subsequent resolution phase is to match each sub element with a term among the contiguous words the system has stored: the resulting match ratio mr will therefore be computed as (Em/E) * 100, where Em is the number of matched elements and E is the total number of elements. The search space within which these matches are to be looked for will first be the abbreviation's previous words, properly tokenized by dashes or other relevant connectors. Empirical tests have shown that the likelihood of an abbreviation being explained is greater among the words that immediately precede it, while it is lower among the words that immediately follow it. If mr 50 is reached at the end of all the passes, then the whole resolution process is repeated by taking into account the words that follow the abbreviation as the new search space. In this latter case, the threshold for a successful resolution is set at mr 50 as well. First pass: matching initial characters The actual resolution process begins by scanning the various elements of the candidate abbreviation to check whether there are terms starting with those elements within the search space. The seeming simplicity of this step may hide its effectiveness, as most "standard" abbreviations, from a wide range of different domains, fall into the category "A Beautiful Concept (ABC)," and are therefore correctly resolved right after this check. Second pass: checking for trailing "s" Many abbreviations are cited in scientific papers as plural nouns and are consequently explained as such. Dendritic epidermal T cells (DETCs), tolllike receptors (TLRs), yeast artificial chromosomes (YACs) are all examples in this regard. The missing trailing lowercase "s", which the first pass could not obviously match with any initial character within the search space, is checked for its presence by the second pass; if the last matched word is actually a plural noun (as in the previous examples cells, receptors and chromosomes all are) the match ratio is updated accordingly. Third pass: checking for spelled-out numbers The explanation of an abbreviation can also contain spelled-out cardinal or ordinal numbers that might correspond to actual digits featured as elements of the abbreviation itself, in any number of positions (usually at the beginning or at the end of the explanation). Third plant homeodomain (PHD3) is a significant example of these particular cases. The third resolution pass is meant to check for the correspondence between digits as abbreviation sub-elements and their spelled-out version in the search space. Fourth pass: combining lowercase letters After the first three passes, there still might be syntactically unmatched elements of the considered abbreviation. An interesting subset of abbreviations is structured in a way that a multi-letter prefix is used instead of a single initial character. This is the case of abbreviations like glutamate receptor (GluR), lidocaine (Lid) or murine leukemia virus (MuLV). The fourth pass tries to find unresolved elements of the abbreviation that are actually lowercase letters, and checks whether they can be combined with previously resolved elements (usually uppercase letters) to form the prefix of some word within the search space, generally already matched with another abbreviation element. As in one of the examples shown above, the unresolved elements l and u of (GluR) will be combined together with the already resolved G, and as a compound will be associated with glutamate, to which the element G had already been matched. Elements resolved in this fashion will consequently be deemed matched as well, increasing the overall match ratio as a consequence. Fifth pass: combining uppercase letters This pass is similar to the previous one, for it takes into consideration unmatched elements of the considered abbreviation, which might be combined with other, previously resolved elements in order to form the prefix of a matched word. The difference lies in its focus on unresolved uppercase letters. For instance, in vasodilator-stimulated phosphoprotein (VASP), the unmatched element A will be combined with its previous element V and matched with vasodilator, which the element V had already been associated with. Match ratio will be increased accordingly. Sixth pass: scattered lowercase letters This pass checks whether there are some unresolved lowercase letters of the considered abbreviation which might be found within the word matched with their previously resolved element. Relevant examples are transferrin receptor (TfR) and allatostatin receptor (AlstR), whose unmatched lowercase letters in the middle (f for the former and s, t for the latter) can actually be found within the matched word of the resolved element placed immediately before them. As in the passes before, match ratio will be increased accordingly if such correspondences are detected. Relevance of bracket-enclosed words for broader matching criteria The application of some of the aforementioned steps might sometimes not produce an optimal abbreviation resolution, typically due to their coarsegrained assumptions which could lead to a number of incorrectly matching false positives. This is actually the case for passes 5-6, whose "bold" matching criteria might end up incorrectly matching abbreviation elements with words within the search space, thus wrongly increasing the overall match ratio. That is why, at the present time, the passes with the broader matching criteria (like 5-6) are performed on one condition: the considered abbreviation must be enclosed in plain brackets, thus having a higher chance of being an actual abbreviation whose explanation can be found among its contiguous words. Results indicate that, currently, this is the most successful balance between using those criteria indiscriminately (thus increasing the number of wrong matches) and not using them at all (resulting in a loss of matching precision). Correlation expressions There can be cases where an explanation, or part of it, does not match any sub-elements of the corresponding abbreviation, for no syntactical bond can be traced back among some or all the abbreviation elements and the explanation terms. This is especially true when the explanation refers to another abbreviation, or more generally when correlation expressions like as known as, also called etc. are used to link theoretically uncorrelated words. When the main passes listed so far have failed to produce a match ratio 50, some semantic rules are applied in order to detect these correlations between abbreviations and their origin. At the present time, these rules are rather simple and basically take into consideration the presence of a list of potential widespread linking expressions within the search space. If found, the match ratio undergoes an increase proportional to the proximity of the correlation expression to the abbreviation itself, with a maximum value of 51 (so that, in the worst case, it might still end up above the 50 threshold). Domain-specific passes: periodic table elements and other biological pseudo-patterns Aside from the general cases mentioned in the previous paragraph, there usually exist a number of noteworthy semantic correlations specific to the biological domain, with which the system could resolve part of otherwise unresolvable abbreviations. One of those is the correspondence between the elements of the periodic table and their respective symbol: for most metals, such a symbol is composed of an uppercase-lowercase letter pair (e.g. Cu = Copper), and is usually employed in biological papers to univocally refer to its associated element, with very minimal risk of ambiguity with other terms. By taking advantage of these known correspondences, in a specific pass the presence of a periodic table symbol is checked within the abbreviation that is being considered: if the corresponding periodic table element is detected in the search space, a semantic match is produced and the match ratio is thus increased. Similarly, it might be possible to stumble upon recurring "patterns" within some protein abbreviations syntactically unmatchable with any word in the search space. This is the case of suffixes having the form digit + p, generally referring to a variant of a known protein (e.g. yeast frataxin homolog (Yfh1p)), for which there can be found no explanation within the text. In cases like these, forcing the suffix as fully resolved (even though syntactically matching with no word) gives a ranking increase to otherwise correctly resolved abbreviations, which could have been ranked as low as a really incompletely-matching one. The boldness of this criterion, with its consequent ranking increment, falls under the category of passes 5-6, and is therefore subject to the bracket-enclosed check earlier described. Proximity correction Correctness of the matched terms can decrease when dealing with a large search space, usually resulting from very long abbreviations. In order to adjust the matching precision, a proximity correction is performed after the last syntactical pass. Basically, it tries to detect resolved elements "distant" to their next element of the abbreviation, in terms of the position of their matched word among the search space. If such a "distant" element is found, it looks for another word whose proximity to the next matched word is higher, and tries to match this word with the considered element, employing the criterion used in the first pass. If a match is established, the element's previously matched word is replaced with the new-found, nearer one. An average 30% increase of correctness has been detected for long abbreviations after applying this proximity correction. Match ratio is unaffected by such a process. Compound recurrence Within a paper, it is frequent to come across partially unresolvable abbreviations whose unmatched, usually contiguous elements are however part of another abbreviation earlier defined in the text. This happens for instance with Hansenula polymorpha amine oxidase (HPAO), Escherichia coli amine oxidase (ECAO), pea seedling amine oxidase (PSAO), where the recurring element AO may not have its explanation among the contiguous words, but may instead refer to a previously mentioned protein (as in CuAO, copper-containing amine oxidase), fully matched with its corresponding definition. Therefore the system checks whether a series of sub-elements (dubbed a "compound") of a perfectly resolved abbreviation (a "parent" abbreviation) are detected inside other, partially or poorly resolved abbreviations within a single article. If found, those matching compounds are updated by matching them with the resolved words already associated to the one belonging to the parent abbreviation. Eventually, match ratio is updated accordingly. The potential ambiguity for abbreviations containing an exact compound of another uncorrelated abbreviation within the same paper is kept to a minimum under the assumption of a perfectly resolved parent abbreviation, along with the lower bound (currently set to 2) placed on the compound's length. Results production Finally, after all the aforementioned steps are over, results are produced in terms of a list of successfully matched SF-LF pairs featuring a match ratio above a set threshold. It must be underlined that showing the potential abbreviation origin as made up of the various matched words lined up together does not always provide us with a 100% correctly retrieved explanation. There might have been words in-between that could not be explicitly matched via the steps performed. For instance, PACSINs, resolved as protein casein substrates in neurons, would be incomplete in comparison with its correct explanation protein kinase C and casein kinase 2 substrates in neurons as found in the original input text. That is why the final product of the resolution phase is the original sequence of words literally reconstructed starting from the delimiting matched words of the abbreviation. This ensures greater accuracy for abbreviation-origin pairs correctly resolved (even though initially missing some in-between words), while it does not significantly alter the result for those pairs that are incorrectly resolved to begin with. Phase 3: Protein Name Identification The third and last phase of the process has the purpose of discriminating those resolved abbreviations that actually correspond to known protein names. In detail, using as input the original sequence of words building up the explanation of a resolved abbreviation, a dictionary-based matching step is carried out. A local copy of the UniProt database [21] is used as a source repository for all the known proteins, and an indexing and a subsequent search step in order to match the input sequence with one of the records within the database is performed. The result is a list of candidate protein names, each with a certain score: those scoring higher are more likely to be the actual proteins appearing in the input sequence (a score of 100 means a perfect match). Finally, a refinement is performed by considering the string similarity between the input words and the candidate protein names, so that the score of those with a greater proximity to the input is increased or maximized. Further details of this phase are described below. Index building and search step In order to effectively find a potential protein within an abbreviation explanation produced in Phase two, an index is built upon the data stored in a local copy of UniProt to be used as a search basis for relevant matches. The open-source Apache Lucene library [22] is employed for fulfilling this task. The index is built only once for any database release. For each UniProt record, the protein name "as-is" and its potentially expanded form in terms of its spelled-out elements, if any, is stored within the index. Thus, for example, in the case of p-21 activated kinase I, the expanded form "p-twenty one activated kinase first" is stored as well. The actual search is performed against the created index, in the form of two queries: one will look for a match between the input words and the index fields representing the protein names, and the other will do the same between the spelled-out versions of the input words and the spelled-out version of each protein name. Case-insensitiveness, stop-word elimination, word stemming (i.e. bringing a word back to its lemma form), and other optimizations related to the number and position of the single words within the input text, are left to Lucene. This search mechanism will assign a score > 0 to those protein names matching (totally or partially) the input words; here a threshold is applied so that only the most relevant matches are returned as candidate protein names. It must be underlined that the two queries are executed as parallel searches, and as such there might be duplicates between their two result sets: such duplicates are then merged by adding up their respective score, so that they could climb some positions in the final ranking (if a candidate appears in both result sets, it is likely to be more relevant than others). Distance-based refinement At the end of the Lucene-based search step, a certain number of the results obtained are already as expected: for those abbreviation explanations that really referred to a protein, the first candidate protein name in the returned list should be the corresponding protein itself; for those coming from other domains, the candidate list should contain low-score values (usually at least < 60 in a normalized scale 0-100) or be empty altogether. However, for specific input texts, an unsatisfactory situation can occur. This is due to the fact that Lucene tends to assign the same score value to any protein name candidate that includes the same number of input words. For example, given the abbreviation GST, which after the Abbreviation Resolution phase has been correctly resolved as Glutathione S-transferase, the queries performed by Lucene produce a list of high-scoring protein name candidates, which are all the variants of this protein stored in UniProt (Glutathione Stransferase APIC, Glutathione S-transferase alpha-1 etc.). However, Lucene will assign the same score to all of the returned candidates containing the three input words Glutathione, S and transferase. Therefore, all of the candidates, including the perfect match (Glutathione S-transferase) will be returned with their score in a tie. As a consequence, any protein names matching exactly the input words may not necessarily appear as first in the candidate list, ordered by decreasing score, as provided by Lucene. In order to overcome this problem, a post-processing of the Lucene's output is carried out using LingPipe's implementation of the weighted edit distance and Jaccard distance [23], thanks to which the system checks how much the input words and the candidate list returned by Lucene are similar. These two distance checks are aimed to encompass two different family of cases, respectively: those terms sharing the higher number of consecutive linguistic units (letters, digits, and so forth) and overall length, and those having entire tokens in common, regardless of their individual position. Distance ranges have been established so that for a distance value falling inside the given range, the corresponding score of the considered candidate protein name is adjusted accordingly. After the application of this technique, the correct protein name is not in the first position in less than 3% of the cases. Results The PRAISED framework has been tested against four annotated corpora: the Medstract Gold Standard, the AB3P, the BioText and the A&T Corpus. These corpora have been annotated by different research groups and are made up of a certain amount of paper abstracts extracted from PubMed, featuring several SF-LF pairs mostly from the biomedical world. More details are listed in Table 2. The results obtained were compared with three major abbreviation resolution systems, as said earlier in RELATED WORK: specifically, Schwartz and Hearst [6], ALICE [13] and BIOADI [12]. The study in [14] was preliminarily used as a basis for comparing these three systems; despite that, the same experiments have been subsequently performed by the authors as well, yielding essentially the same results (with slight discrepancies in the number of annotated SF-LF pairs detected in the corpora, which did not alter the final results nonetheless). Regarding BIOADI and ALICE, their respective online tools were used for the experiments. As far as S&H is concerned, we used the implementation provided in [24]. Table 2. Details of the corpora used to test the PRAISED system Corpus MEDSTRACT AB3P BioText A&T Papers 400 1250 1000 993 SF-LF pairs 409 1221 956 1095 Words ~78000 ~227000 ~244000 ~205000 In order for this comparison to actually take place, the PRAISED execution was tested up to Phase 2, therefore ignoring the entity recognition phase. This was done because of the nature of the compared systems, which stopped at expanding the detected abbreviations as well and did not discriminate them according to a determined context. Given the following definitions: Precision = n. of SF-LF pairs resolved / n. of SF-LF pairs retrieved Recall = n. of SF-LF pairs resolved / n. of SF-LF pairs total F-measure = 2 (Precision Recall) / (Precision + Recall) significant results in terms of recall have been obtained, outperforming the other systems in most of the cases; the overall f-measure was promising, while the precision scores stayed below those of the competitors. These results are shown in Figures 2, 3 and 4, and although interesting, they are by no means surprising. The high level of recall achieved, in fact, is due to the presence within the scanned corpora of a limited variety of abbreviation forms, as already mentioned in RELATED WORK. Since the purpose of the PRAISED implementation was to identify and resolve many variations of SFLF pairs within the full texts of scientific articles, this was a somewhat expected outcome. Even more, while processing the Medstract, AB3P, BioText and A&T corpora, an additional number of SF-LF pairs was found and resolved (most of them correctly, see Table 3) which had not been annotated at all. Out of those PRAISED could not detect and resolve, instead, it must be stressed out that less than 7% were actually protein abbreviations, and thus relevant for its ultimate purpose. PRAISED was then tested against biological full-text articles, by using a manually annotated subset of PubMed papers as the input source of the abbreviations discovery process. The subset used consists of 120 randomly chosen publications ranging in topics from structural biochemistry to molecular and cell biology, of the past twenty years (1990-2010). This subset is not particularly large, but is significant in terms of all the disparate cases of SFLF pairs featured in it. In this case, only protein abbreviations were selected for annotation. This was a great challenge indeed, since it has been opted for a plain, unadulterated processing of the articles (keeping footnotes, figure captions and so on). The results obtained were 75.6% recall and 89.1% precision (f-measure: 82.9), for the whole discovery process (Phase 1, 2 and 3). As far as the protein identification phase (Phase 3) was concerned, results showed 91.4% recall and 79.5 precision (f-measure: 85). Recall did not vary between the first and the second phase, and the precision levels of the first phase was highest, since no abbreviation was incorrectly identified as such. Figure 2. Comparison of precision values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Figure 3. Comparison of recall values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Figure 4. Comparison of f-measure values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Table 3. Additional SF-LF pairs resolved by PRAISED not originally annotated within the corpora Corpus MEDSTRACT AB3P BioText A&T Additional SF-LF pairs 73 26 22 12 Correct 69 (94.5%) 19 (73%) 22 (100%) 10 (83.3%) For the sake of completeness, it must be also stated that, taking into account only those papers actually featuring an abbreviation list at the beginning (less than 30% of the subset collected), recall and precision were respectively 84.4% and 81.6% in terms of the protein abbreviations appearing in their lists. The lower precision of Phase 3 can be explained taking into account that the full-text article corpus features a number of biomedical non-protein abbreviations, correctly resolved by Phase two, which contain several terms appearing in many known protein names from the repository used (e.g. most chemical compounds, like "nitric oxide" appearing in the protein name "nitric oxide synthase"). As a consequence, during Phase three, several of these explanations generated non-empty candidate lists of protein names, with candidates scoring higher than the threshold set (> 60). Refining the protein name identification in this regard will undoubtedly improve the correctness of the whole process. Regarding recall, instead, the major issues lie in the resolution phase, since the < 9% recall lost in Phase 3 is only caused by the presence of malformed or misspelled/contracted protein names: most of the remaining undetected SF-LF pairs were lost during Phase 2. As discussed earlier in METHODS Section, while processing many fulltext papers, morphologically uncorrelated SF-LF pairs are encountered. In the majority of these cases, the explanation could only be inferred by human insight, since it bared no apparent syntactical (or easy semantic) bond with its respective abbreviation. For some others, hundreds of words separate the abbreviation and its explanation, thus escaping the contiguity-based approach implemented in PRAISED (including more sophisticated techniques like the compound recurrence). This underlines the need of enhancing the resolution criteria employed, in order to try and resolve a greater number of the more intricately correlated abbreviations. A semantic approach like the employment of an appropriate ontology (either dynamically constructed by the system or predefined) might come in handy for tracing back and detecting more sophisticated relationships among the terms in the scientific papers. Despite that, human intervention will still be needed for very loosely-bound SF-LF pairs, which indeed amounted to a substantial subset of those that could not be automatically resolved. An output excerpt of the discovery process can be found in Figure 5. In terms of performance, being a light-weight approach, the execution time for the whole process execution is considerably short. On a i7 QuadCore machine with a medium load from other tasks, an average of 150000 words per minute from the scientific texts used can be processed, resulting in an almost instantaneous elaboration of most individual papers. As an example, a 20-page-long paper consisting of 8000 words is processed in about 3 seconds. As already mentioned, scalability and customization have been also taken into great consideration: changing the combination of identification metrics or introducing new ones, modifying the resolution process by adding/removing passes and criteria, or fine-tuning the ranking assignment and threshold settings for other potential application domains (e.g. the medical domain), can all in principle be performed. The use of declarative, out-of-code rules in order for the customization process to be even smoother and more productive has also been planned. Figure 5. A partial screenshot of an execution output of the whole discovery process. For each resolved abbreviation, the system shows, respectively from top to bottom and from left to right: the abbreviation in its pure form (e.g. DISC), the abbreviation as found in the input text (e.g. (DISC)), the Abbreviation Resolution match ratio (AR), the Abbreviation Identification rank (AI); the abbreviation explanation obtained by the PRAISED process (e.g. death inducing signalling complex), the words contiguous to the abbreviation (e.g. forming the death inducing signalling complex), and the candidate proteins (if any) with their corresponding score in decreasing order, with the best candidate on top (e.g. 63.73 Death-inducing protein). Let us stress out some interesting elements that can be detected in this output. The lower AI rank for ICE derives from the lack of enclosing brackets in the input text: this abbreviation is correctly marked as such nevertheless, but has a rank obviously lower than those abbreviations enclosed within brackets. DISC, resolved as death inducing signalling complex, could not be perfectly matched with any known protein in the repository used; the closest match (Death-inducing protein) scores a value slightly higher than the 60 threshold, and therefore appears as a candidate in the system output, even though it might not exactly correspond to the resolved protein abbreviation (here, the system also shows the other, under-the-threshold candidates below the highest-scoring one; if none of the candidates had scored over 60,the system would have just returned an empty list and a protein match failure message). ICE, on the other hand, resolved as interleukin converting enzyme, gets correctly matched with the protein Interleukin1 beta-converting enzyme, while not scoring a perfect 100 due to the apparent inbetween terms missing from the explanation; it must be also underlined that, in this particular case, the explanation for ICE was not found among its contiguous words, but instead resulted from the application of more sophisticated criteria (like the compound recurrence and/or the semantic expressions). Discussion In this paper a methodology for the automatic identification and resolution of protein abbreviations extracted from full-text biological papers has been proposed. Such a methodology is based on a light-weight, threephase process meant to identify and resolve SF-LF pairs within a scientific article, and match them with their corresponding protein names, if any. Lexical checks and exclusion rules are employed in the identification phase, syntactical and semantic criteria are used for the resolution phase, and a dictionary-based approach is applied to match the abbreviation explanations to their potential protein names. The implementation of this process in the PRAISED framework has achieved significant results when applied to a number of known corpora and compared, up to the abbreviation resolution phase, with some of the most relevant approaches in this area. Recall value was in most of the cases higher than competitors (Figure 3), f-measure was strong (Figure 4) and those undetected SF-LF pairs only marginally amounted to protein abbreviations. Moreover, PRAISED was able to resolve correctly in most of the cases (Table 3) additional SF-LF pairs not originally annotated within the corpora. The application of the whole process to the real testing ground of fulltext papers predictably yielded lower performance values, but this is an anticipated outcome, given the highly unstructured domain approached and the challenging nature of the task itself. Nevertheless, a considerable amount of protein abbreviations were correctly identified, expanded and matched with their corresponding UniProt records. The ability of PRAISED of approaching the problem of abbreviation identification and resolution on fulltext articles is probably its most important feature and the one that makes it relevant for the biomedical community. In fact, full-text papers represent the real source of complex information which researchers are faced with in their everyday activity. Based on the empirical evidence obtained while digging into the disordered context of scattered and non-homogeneous abbreviations seen in the scientific publications, as well as from the results of the testing stage of the discovery strategy, available areas of improvement can be identified. To name a few, correlation criteria in the resolution phase can be extended and enhanced, in order to resolve more intricately correlated abbreviations. The application of a specific annotated ontology could also result in the detection of complex interrelationships among lexical terms, provided that the system takes properly advantage of it by introducing a deeper semantic approach. Moreover, the accuracy of the protein name matching phase will have to be strengthened, by extending its post-processing step in order to increase its precision when dealing with non-protein abbreviation explanations. Furthermore, the described methodology is meant to be extended from the current paper-by-paper basis to a corpus-based approach, in order to deal with cross-references among the scientific articles. Many abbreviations, in fact, are only cited but not explained within a considered paper, and thus cannot be treated as resolvable SF-LF pairs, as no definition is explicitly given for them within the context of a single article. Even more, there are situations (some of them preliminarily tackled via correlation expressions) where the explanation of an abbreviation includes other abbreviations as well, often not defined within that context. One such example is BRCT, whose expansion is BRCA1 carboxy terminal domain. Extending the PRAISED approach along these guidelines will allow to resolve also these special cases and even further improve the resolution potential of PRAISED, thus providing the users with an improved human-computer interface for extracting protein information in the literature. However, it must be kept in mind that there is a significant number of cases in which the SF-LF pairs are probably simply irresolvable. For example, one abbreviation encountered in the analysis of the annotated corpus used in this study is ABC7/Atm1p whose correct expansion, Membrane protein believed to be responsible for iron export from mitochondria has no semantic relationship with the abbreviation itself. Finally, the PRAISED system will be made available shortly, along with the list of papers building up the annotated corpus used for its testing phase and the related human-extracted SF-LF pairs. For the sake of its release, a provenance layer is currently under development, so that it will be possible to keep track of the whole discovery process and the information flow behind it. This way, the scientific community will be able to assess the correctness of the abbreviation archive with greater ease and reliability: each resolved abbreviation stored in the archive could be traced back to the paper(s) whence it was derived, along with the processing steps employed for its resolution. In the meantime, a beta version is available upon request to the authors. Rerences 1. 2. 3. Pustejovsky J, Castao J, Cochran B, Kotecki M, Morrell M, Rumshisky A. 2001. Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases. Medinfo 10. 371-375. Taghva K, Gilbreth J. 1999. Recognizing acronyms and their definitions, International. Journal on Document Analysis and Recognition. 1. 191-198. Yeates S. 1999. Automatic extraction of acronyms from text. In Third New Zealand Computer Science Research Students' Conference. 117-124. Larkey L, Ogilvie P, Price A, Tamilio B. 2000. Acrophile: An Automated Acronym Extractor and Server. In Proceedings of the ACM Digital Libraries conference. 205-214. Park Y, Byrd RJ. 2001. Hybrid Text Mining for Finding Abbreviations and Their Definitions. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing. Schwartz A, Hearst M. 2003. A simple algorithm for identifying abbreviation definitions in biomedical texts. In Proceedings of the Pacific Symposium on Biocomputing (PSB). Yu H, Hripcsak G, Friedman C. 2002. Mapping abbreviations to full forms in biomedical articles. J. Am. Med. Inform. Assoc. 9. 262-272. Nadeau D, Turney PD. 2005. A Supervised Learning Approach to Acronym Identification. In 18th Conference of the Canadian Society for Computational Studies of Intelligence. Canadian AI. 2005. Chang JT, Schutze H, Altman RB. 2002. Creating an Online Dictionary of Abbreviations from MEDLINE. J. Am. Med. Inform. Assoc. 9. 612-620. Yoshida M, Fukuda K, Takagi T. 2000. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169-175. Fukuda K, Tsunoda T, Tamura A, Takagi T. 1998. Toward Information Extraction: Identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing. 705-716. Kuo C, Ling MHT, Lin KT, Hsu CN. 2009. BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature. BMC Bioinformatics 10. p7. Ao H, Takagi T. 2005. Alice: an algorithm to extract abbreviations from medline. J. Am. Med. Inform. Assoc. 12. 576-586. Gawlik M, Strable C. Comparison of abbreviation recognition algorithms. http://acm.mscs.mu.edu/wiki-reu/index.php/User:Mgawlik Hersh W, Voorhees E. 2009. TREC genomics special issue overview. Information Retrieval 12. 1-15. Atzeni P, Polticelli F, Toti D. 2011. An Automatic Identification and Resolution System for Protein-related Abbreviations in Scientific Papers. In EvoBio. http://www.medstract.org/index.php?f=gold-standard http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/ http://biotext.berkeley.edu/data.html http://3.uvdb.dbcls.jp/ALICE/corpus download.html http://www.uniprot.org/ http://lucene.apache.org http://alias-i.com/lingpipe/ http://biotext.berkeley.edu/code/abbrev/ExtractAbbrev.java APPENDIX A Protein abbreviations2 list extracted from the full-text corpus of the manuscript by Toti et al., "Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework." Abbreviation 4-MD-2 ABC7/Atm1p ABCA1 ABCG5 ABCG5 ABCG8 ABCG8 Abl Abl ACAT2 ActA ActA ActA ActA ActA AE33 AFP AGAO AHNAK Akt Akt Alb AlstR AlstR1 AP2 AP2 AP2 AP2 Apaf-1 APC apoA1 Definition undefined Membrane protein believed to be responsible for iron export from mitochondria ATP-binding cassette transporter Undefined undefined Undefined undefined Abelson tyrosine kinase undefined Undefined Undefined undefined undefined undefined undefined undefined alpha-fetoprotein A. Globiformis CuAO undefined Undefined undefined Plasma albumin Allatostatin receptor undefined Undefined undefined undefined undefined Apoptotic protease activating factor antigen presenting cell Undefined Note that multiple occurrences of the same abbreviation are due to the presence of the abbreviation in more than one paper of the full-text corpus, that a single abbreviation can be undefined in one paper and defined in others and finally defined in different ways in different papers. Apo-AGAO apoE apoL-I APP APPs Arp2/3 ArsR Ash Aso1 ATP7A ATP7B ATPase Axcyt c' BACH1 Bem1 BmrR bRC BRCA1 BRCT BRCT2 Brn-2 BSA BSA BSAO BsHemAt BtuB C/EBP alpha C/EBP C/EBP C/EPB-ER C1R C1S CAII CAM-SLR CB1 Cbp1p c-cbl CCP CCS CD116/CD18 CD14 CD163 CD163 CD163 CD163 CD163 Apo A. Globiformis CuAO Undefined apolipoprotein L-I amyloid precursor protein Acute phase proteins undefined undefined undefined Polyamine oxidase from C. bidinii Menkes disease protein Wilson disease gene product Undefined Alcaligenes xylosodans cytochrome c' undefined undefined undefined Reaction center of photosynthetic purple bacteria undefined BRCA1 carboxy terminal domain undefined undefined undefined undefined Bovine serum amine oxidase Haem based aerotaxis Transducer sensor domain of B. subtilis GCS Vitamin B12 transporter protein undefined member of the C/EBP transcription factor family CCAT enhancer binding protein-epsilon Undefined undefined undefined Human carbonic anhydrase Carasius somatostatin-like receptor Central cannabinoid receptor heterotrimeric GTP-binding protein Corticosteroid binding protein Cellular homologue of Casitas B lineage lymphoma protooncogene product Complement control protein undefined Surface antigens undefined Macrophage cell surface receptor undefined undefined Scavenger receptor CD163 Scavenger receptor CD163 CD22 Surface antigen CD2BP2 CD2 binding protein CD36 Undefined CD44 Raft associated hyaluronate transporter CD6 undefined Cdc25 undefined Cdc42 undefined CED-3-like C. elegans protein-like ChEs Cholinesterases C-jun undefined CK Creatine kinase c-kit undefined CoaR undefined COMMD1/MURR1 undefined CooA undefined Cox Cytochrome c oxidase Cp Ceruloplasmin Cp Ceruloplasmin Cp Ceruloplasmin CP43 Undefined CP43 undefined CP43 undefined CP43 undefined CP47 Undefined CP47 undefined CP47 undefined Crk undefined CS Citrate synthase CS Citrate synthase Cu,Zn SOD Copper, zinc superoxide dismutase CuAOs Copper containing amine oxidases CueO Bacterial cupreous oxidase CueR undefined Cyt b559 Cytochrome b559 D1 undefined D1 undefined D1 undefined D2 undefined D2 undefined DAP Drostatin DCT1 Divalent cation transporter Dcytb Duodenal cytochrome b deoxyMb Undefined Diff undefined DISC Death inducing signalling complex Dlar undefined DMT1 undefined DMT1 undefined DMT1/DCT1/Nramp2 Divalent metal transporter DNA Pol V undefined Dorsal undefined DpsA undefined DT40 undefined DtxR Difteria toxin repressor ECAO E. coli CuAO EDH1 undefined EDH2 undefined EDH3 undefined EDH4 undefined EGFP undefined EHBP1 undefined EHD EPS15 homology domain eMAP undefined Ena Drosophila-Enabled protein Ena Enabled adapter protein Ena undefined Ena undefined Ena/VASP Enabled/vasodilator-stimulated phosphoprotein ENA/VASP undefined Endo-H Endoglycosidase Eps8 Epidermal growth factor receptor substrate ER Estrogen receptor (undefined?) ERK Extracellular signal-related kinase ERK1 undefined ERK2 undefined ERP60 undefined ERP72 undefined Ess1/Pin1 Peptidyl prolyl cis/trans isomerise EVH1 Ena/Vasp homology 1 EVH1 Enabled/Vasodilator stimulated phosphoprotein homology 1 EVH1 Enabled/VASP homology 1 EVH1 Enabled/VASP homology domain 1 EVH1/2 Ena/VASP homology 1/2 EVH2 undefined EVL Enabled/vasodilator-stimulated phosphoprotein-like protein EVl ENA/VASP like protein EVL Ena/VASP-like FAAH Fatty acid amide hydrolase FE65 undefined FEN2 Plasma membrane H+- pantothenate transporter Fet3 Undefined Fet3p Undefined Fet3p undefined Fet4 Undefined FixL undefined FMS1 Polyamine oxidase from S. cerevisiae Fpn FPR FPRL1 FRE1, 2 FRS2 FSH Ftr1 Ftr1p Fur Fyb/SLAP Fyb/SLAP Fyn Gamma-GCS Gamma-GTP Gap1 GAPDH GAPDH GAPDH GCN5 GCS G-CSF G-CSF GDNF Gel GFP GFP GH GHRH GHS-R GIRK1 GIRK1 Glu-C Glut4 GLUT-4 GM130 GNBP GNBP-1 GNBP-3 Gp340 GPCRs G-protein GPx GR Grb2 Grb2 GST GST GST Ferroportin Formyl peptide receptor FPR related lipoxin A4 receptor Ferrireductase Fibroblast growth factor receptor substrate 2 Follicle-stimulating hormone Membrane permease undefined Ferric uptake regulator Fyn-binding and SLP-76 associated protein T cell signalling Fyn binding protein/SLP-76-associated protein undefined Gamma-glutamyl cysteine synthetase Gamma-glutamyl transpeptidase GTPase-activating protein Glyceraldheyde-3-phosphate dehydrogenaseundefined undefined undefined Undefined Globin coupled sensor granulocyte colony stimulating factor (RESOLVED) granulocyte-colony-stimulating factor Glial cell line-derived neurotrophic factor Gelatinase Green fluorescent protein undefined Growth hormone Growth-hormone-releasing hormone G-protein coupled receptor undefined G protein-gated inwardly rectifying potassium channel undefined Insuline Responsive Glucose transporter Glucose transporter undefined Gram negative binding protein undefined undefined undefined G-protein coupled receptors GTP binding protein Glutathione peroxidise Glutathione reductase Growth factor receptor-bound 2 undefined Glutathione transferase Glutathione transferase Gluthatione transferase GTPase undefined GTPase undefined GTPase undefined HasA undefined Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb undefined Hck undefined HCS70 undefined HCS73 undefined HDL High density lipoprotein HFE undefined HFE Undefined HIV Tat undefined HIV-1 RT HIV-1 reverse transcriptase HIV-RT HIV reverse transcriptase HLA-H Undefined HMG CoA reductase Undefined HMG-CoA reductase undefined HMGCR HMG-CoA reductase HMGCR undefined HO-1 Heme oxygenase Holo-AGAO holo A. Globiformis CuAO Homer undefined Homer undefined Hp 1 Undefined Hp 1 Variant of the Hp gene Hp 1-1 major phenotype of Hp Hp 1-1 Undefined Hp 2 Undefined Hp 2 Variant of the Hp gene Hp 2-1 major phenotype of Hp Hp 2-1 Undefined Hp 2-2 major phenotype of Hp Hp 2-2 Undefined Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin HP Hephaestin Hp Human hephaestin Hp undefined Hp1 undefined Hp1-1 Major phenotypic form of haptoglobin Hp1-1 Hp2 Hp2-1 Hp2-1 Hp2-2 Hp2-2 HpA0 HPAO Hpr Hpr Hs HS7C HSF-1 HSP1 HSP10 Hsp16.3 Hsp26 HSP27 HSP60 HSP60 HSP60 HSP70 HSP70 HSP70 HSP70 HSP70 HSP90 HSP90 HSPA8 hTII hTII HVA ICE-like IdeR IgA1 proteases IgG IL-6 Il-6 iNOS IP3Rs IREF2 IREG1 IREG1 Ireg1 IRP1 IRP1 IRP2 IRP2 IRPs 1 and 2 undefined undefined Major phenotypic form of haptoglobin undefined Major phenotypic form of haptoglobin undefined undefined H. polymorpha CuAO Haptoglobin related protein Haptoglobin related protein Haemosiderin undefined Heat shock transcription factor 1 undefined undefined Undefined undefined undefined undefined undefined undefined Heat shock protein undefined undefined undefined undefined undefined undefined undefined Human topoisomerase II Human topoisomerase II High voltage activated Ca++ channels Interleukin-1 converting enzyme-like undefined Undefined Undefined ploinflammatory cytokine NO synthase Inositol-1,4,5-trisphosphate receptors undefined undefined undefined Ferroportin 1 Iron regulatory proteins undefined Iron regulatory proteins undefined Iron regulatory proteins IRSp53 IscA IscS IscU Itk JH/JHs KBD L Lamp 1 Lamp 2 Lamp LasR LasR-LBD LDL LDL LDLR LDLs LEKTI Lf LFA-1 LFA-1 LfN LH LPRs LPS LRP5 LRP6 LSD1 LuxI/LuxR LuxR LVA LXR Lyn Lys M MAE2 MAO A MAO B MAOs MAP2 MaPgb MAPK MAPK MAPKAPK2 MARCO Mb Mb(s) MbCO undefined undefined undefined undefined undefined Juvenile hormone(s) undefined undefined undefined undefined Lysosome associated membrane protein Undefined Undefined Low density lipoprotein Undefined LDL receptor Low density lipoproteins Undefined Lactoferrin lymphocyte function-associated antigen 1 undefined N-terminal half-molecule of human lactoferrin Luteinizing hormone Low-density lipoprotein receptor related proteins undefined undefined undefined Histone lysine specific demetilase Undefined Undefined Low voltage activated Ca++ channels Liver X receptor undefined Lysozime undefined Malonamidase undefined Monoamine oxidase B Monoamine oxidases Microtubule associated protein 2 M. Acetivorans protoglobin Mitogen activated protein kinase Mitogen-activated protein kinase undefined undefined Myoglobin Myoglobin(s) Undefined Mb-Xe MDC1 MENA Mena Mena Mena MerR MerR MetAPs MFT MFT mGluR mGluRs MHC MLE MntR MPO MPR MR MRE11 MT MT1 MTP MTP1 Mtp1 MTP1 Myc Myo32 Myo32 Myo32 NafY Nav 1.8 NBS1 Nck Nck NCP Nedd-4 NEX4 NFBD1 NF-E2 NF-kB NF-kB NF-kB NF-kB NF-KB NGAL NifS NifU Undefined undefined Mammalian Ena Mammalian enabled Mammalian enabled adapter protein (Ena) undefined undefined undefined Methionine aminopeptidases Mitochondrial iron importer Undefined Metabotropic glutamate receptor metabotropic glutamate receptors Undefined Muconate lactonizing enzyme undefined myeloperoxidase Mannose 6-phosphate receptor Mandalate racemase undefined metallothionein MT Metallothioneins undefined undefined Metal transporter protein Metal transporter protein Ferroportin 1 undefined undefined undefined undefined undefined Tetrodoxin resistant sodium channel undefined undefined undefined Non collagen protein undefined C. elegans annexin undefined undefined Undefined undefined undefined Nuclear Factor kB undefined specific granule protein undefined undefined NikR NiSOD NK N-Mena Nod1 Nod2 NPC1 NPC1 NPC1 NPC1 NPC1l1 NPC1L1 NPC1L1 NPC2 Npw38 NPY-like Nramp2 Nramp2 NS3 NUMB Ovo OxyR P130 cas P34 cdc2 P38 MAPK p38 P53BP2 PAOs PARP1 PbrR PEBP2/CBF Pgb PGLYRP-1 PGLYRP-2 PGLYRP-3 PGLYRPs PGRPI-alpha PGRPI-beta PGRPI-L PGRPI-LB PGRPI-LC PGRPI-LE PGRPI-S PGRPI-SA PGRP-L PGRP-LC PGRP-LE PGRPs Nickel uptake regulator undefined Neurokinin-like receptors undefined undefined undefined Niemann-Pick C1 Undefined undefined undefined Niemann-Pick C1-like 1 Nieman-Pick C1 like 1 intestinal sterol transporter Nieman-Pick C1-like 1 Undefined undefined Undefined undefined undefined undefined undefined ovotransferrin undefined undefined undefined undefined undefined p53 binding protein Polyamine oxidases undefined undefined undefined Protoglobin undefined N-acetylmuramoyl-L-alanine amidase undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined Long PGRPs undefined undefined Peptidoglycan recognition proteins PGRPs PGRP-S PGRP-S1 PGRP-SA PGRP-SD PH PI3K PIKK Pin 1 PKC Plc gamma Plc gamma PmxB PNGase F PNGase-F PPAR PPLO PQBP-1 PRL Prrp PS PsbA PsbB PsbC PsbE PsbF PsbH PsbJ PsbK PsbN PsbO PsbU PsbV PsbV PsbZ PSI PSI PSI PSII PSII PSII PSII PSII PSII PSTPIP PTP1B Rab11 Rab11 Peptidoglycan recognition proteins Short PGRPs Drosophila PGRP Drosophila PGRPs Drosophila PGRP Pleckstrin homology domain Phosphatidyl inositol 3-kinase Phosphoinositide-3-kinase-related protein kinase undefined Protein kinase C undefined undefined Polymyxin B Undefined peptide N-glycosidase F Peroxisome proliferator activated receptor P. Pastoris CuAO undefined Prolactin proline-rich RNA-binding protein Photosystem PSAO Pea CuAO undefined undefined undefined undefined undefined Undefined undefined undefined undefined undefined undefined Cytochrome c550 undefined undefined Photosystem I Photosystem I Photosystem I Photosystem II Photosystem II Photosystem II Photosystem II Photosystem II Photosystem II undefined Protein tyrosine phosphatise 1B undefined undefined Rab11a Rab11Fip2 Rab4 Rab7 Rac Rad50 Raf Raf1 Ran Ran Ran RanBP1 Ras Rb21 Rccyt c' RET RhoA RT RTK RTK RXR RyRs SAA SCAP Scap SDH SdiA Sema6A-1 Sema6A-1 SERCA SFR1 SFR1 SFR1 sGC Shank SHP-2 SHSPs SM22 SMF1 Smf1 SmtB SOD SOD SOD SOD1 SOD2 Sos SoxR undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined Undefined Rhodobacter capsulatus cytochrome c' undefined undefined Reverse transcriptase undefined Receptor tyrosine kinase Retinoid X receptor Ryanodine receptors Serum amyloid A undefined undefined Succinate dehydrogenase Undefined semaphorin 6A-1 Semaphorin 6A-1 Undefined undefined undefined undefined Soluble Guanylate Cyclase undefined Src homology 2 domain containing protein tyrosine phosphatise 2 Small heat shock proteins undefined Yeast manganese transporter Undefined undefined Superoxide dismutase Superoxide dismutase Superoxide dismutase undefined undefined undefined undefined SP Spa(AIM) Spreads Spred Spred-3 Sprouty Sprouty2 Sprouty3 SR-A SR-AI SR-B SR-B1 Src Src SREBP-2 SST SSTR SSTR2 SSTR23 STAT3 TASK-1 TCR TCTP/HRF TESK1 Tf TfR TfR TfR1 TfR2 TIM TIP60 TLF1 TLR2 TLR4 TLRs TNF TNF TNF TNFalpha TNF- Toll tPA TraR TRC8 TRH TrkA TRPS TRPV5 Serine protease undefined Sprouty related proteins with an EVH1 domain sprouty-related protein with EVH1 domain undefined undefined undefined undefined Scavenger receptor undefined undefined Scavenger receptor class B type 1 undefined undefined Undefined Somatostatin Somatostatin receptor undefined undefined Signal transducer and activator of transcription 3 undefined T-cell receptor undefined Testis-specific protein kinase-1 Transferrin Transferrin receptor Undefined at the first occurrence but TfR1 and Tfr2 Transferrin receptor 1 Transferrin receptor 2 Triosephosphate isomerise Histone acetyltransferase Trypanosome lytic factor-1 undefined undefined Toll like receptors Undefined Tumor necrosis factor Tumor necrosis factor Tumor necrosis factor alpha Tumor necrosis factor- undefined Tissue plasminogen activator Undefined undefined Thyrotropin-releasing hormone undefined Tryptophan synthase undefined TRPV6 TsH Tsk VAP-1 VASP VASP VASP VASP VASP VESL Vesl Vesl VP2 WASP WASP WASP WASP WH1 WIP YAP Yes Yfh1p ZntR ZO-1 -GT -GT -Aga IVA -CTx GVIA -CT undefined Thyroid-stimulating hormone undefined Vascular adhesion protein-1 undefined Vasodilator stimulated phosphoprotein Vasodilator stimulated phosphoprotein Vasodilator-stimulated phosphoprotein vasodilator-stimulated phosphoprotein undefined undefined undefined undefined Wiskott-Aldrich syndrome Wiskott-Aldrich syndrome protein Wiskott-Aldrich syndrome protein Wiskott­Aldrich syndrome protein WASP homology 1 WASP interacting protein Yes associated protein undefined Frataxin homolog undefined undefined -Glutamil transpeptidase -Glutamil transpeptidase -Agatoxin IVA -Conotoxin GVIA -Conotoxin MVIIC APPENDIX B List of bibliographic references of the papers building up the full-text corpus of the manuscript by Toti et al., "Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework." Antioxidants & Redox Signaling 7, 2005, 964-972 Archives of Biochemistry and Biophysics 362 (1999) 67­78 Archives of Biochemistry and Biophysics 428 (2004) 22­31 Archives of Biochemistry and Biophysics 444 (2005) 15­26 Archives of Biochemistry and Biophysics 498 (2010) 83­88 Biochemical and Biophysical Research Communications 282, 904­909 (2001) Biochemical and Biophysical Research Communications 303 (2003) 771­776 Biochemistry 1997, 36, 341-346 Biochemistry 1998, 37, 5394-5406 Biochemistry 2002, 41, 5963-5967 Biochemistry 2003, 42, 3464-3473 Biochemistry 2004, 43, 2829-2839 Biochemistry 2004, 43, 3289-3300 Biochemistry 2004, 43, 3979-3986 Biochemistry 2005, 44, 10914-10925 Biochemistry 2005, 44, 14725-14731 Biochemistry 2007, 46, 6097-6108 Biochimica et Biophysica Acta 1685 (2004) 8 ­ 13 Biochimica et Biophysica Acta 1757 (2006) 90­105 Biochimica et Biophysica Acta 1767 (2007) 79­87 Biochimica et Biophysica Acta 1791 (2009) 679­683 Biogerontology 3: 161­173, 2002. Biol. Chem. 383, 1667 ­ 1676, 2002 Bioorganic & Medicinal Chemistry 11 (2003) 21­29 Biophysical Chemistry 101 ­102 (2002) 145­153 Biophysical Journal 86 (2004) 3855­3862 Blood (2006) 108, 2946-2949 Blood (2006) 108, 353-361 BMC Biology 2007, 5:17 Cell 111, 733­745, 2002 Cell 123, 1213­1226, 2005 Cell 97, 471­480, 1999 Cell Metabolism 7, 508­519, 2008 Cell, 121, 1059­1069, 2005 Cell. Mol. Life Sci. 57 (2000) 1970­1977 Cell. Mol. Life Sci. 59 (2002) 1413­1427 Cellular Microbiology (2006) 8, 1059­1069 Current Biology 11 (2001) R399-R401 Current Enzyme Inhibition, 2005, 1, 85-95 Current Opinion in Cell Biology 2002, 14:88­103 Current Opinion in Structural Biology 2004, 14:447­453 Current Opinion in Structural Biology 2004, 14:765­774 EMBO 19 (2000) 5661-5671 EMBO reports 9 (2008) 157-163 Environ. Sci. Technol. 2005, 39, 5378-5384 Eur. J. Biochem. 264, 271-275, 1999 Experimental Cell Research 315 (2009) 119 ­ 126 Experimental Gerontology 39 (2004) 1475­1484 Expert Rev. Proteomics 1, (2004), 89-100 FASEB J. 14, 231­241 (2000) FASEB J. 15, 1303-1305 (2001) FEBS Journal 272 (2005) 1727­1738 FEBS Letters 499 (2001) 256-261 FEBS Letters 513 (2002) 45-52 FEBS Letters 564 (2004) 225-228 Inorg. Chem. 1998, 37, 4030-4039 Inorganica Chim Acta. 2005, 358, 2933­2942 International Journal of Biochemistry & Cell Biology 33 (2001) 940­959 J. Cell. Mol. Med. 8, 2004, 201-212 J. Med. Chem. 2006, 49, 3800-3808 J. Med. Chem. 2006, 49, 7754-7765 J. Mol. Biol. (2002) 317, 41-72 J. Mol. Biol. (2002) 324, 105­121 J. Mol. Biol. (2003) 328, 505­515 J. Mol. Biol. (2004) 338, 103­114 J. Mol. Biol. (2005) 347, 565­581 J. Mol. Biol. (2005) 350, 987­996 J. Mol. Biol. (2007) 371, 1038­1046 J. Neurochem. 67, 2155--2163 (1996) J. Peptide Res., 2003, 61, 202­212 J. Phys. Chem. B 2004, 108, 12990-12998 J. Phys. Chem. B 2005, 109, 19929-19935 Journal of Biological Chemistry 271, 18379­18386, 1996 Journal of Biological Chemistry 275, 19906­19912, 2000 Journal of Biological Chemistry 275, 27940­27946, 2000 Journal of Biological Chemistry 277, 17209­17216, 2002 Journal of Biological Chemistry 277, 39937­39943, 2002 Journal of Biological Chemistry 279, 31842­31853, 2004 Journal of Biological Chemistry 279, 31873­31882, 2004 Journal Of Biological Chemistry 281, 14241­14249, 2006 Journal Of Biological Chemistry 281, 36477­36481, 2006 Journal of Biological Chemistry 282, 1072­1079, 2007 Journal of Biological Chemistry 282, 13592­13600, 2007 Journal of Cell Science 117, 2631-2639, 2004 Journal of Experimental Biology 203, 841­856 (2000) Journal of Lipid Research 50, 2009, 1653-1662 Journal of Molecular Graphics and Modelling 19, 146­149, 2001 Journal of Neurochemistry, 2003, 85, 610­621 Mitochondrion 10 (2010) 83­93 Mol. Biol. Evol. 18(2):120­131. 2001 Mol. BioSyst., 2005, 1, 79­84 Molecular Biology of the Cell 17, 163­177, 2006 Nature 402 (1999) 656-660 Nature 409 (2001) 198-201 Nature 438 (2005) 1040-1044 Nature 450 (2007) 1201-1206 Nature 454 (2008) 1123-1127 Nature 454 (2008) 1127-1132 Nature Structural and Molecular Biology 12 (2005) 582-588 Nature, 389 (1997) 753-758 Neurobiology of Aging 21 (2000) 455­462 Neurology 2004;63:1912­1917 Nucleic Acids Research, 2004, 32, D129-D133 Photosynthesis Research (2005) 84: 153­159 Photosynthesis Research 77: 35­43, 2003 Physiol Rev 84:41-68, 2004 PNAS, 1999, 96, 2042­2047 PNAS, 2001, 98, 7760­7764 PNAS, 2002, 99, 1264­1269 PNAS, 2002, 99, 3505­3510 PNAS, 2003, 100, 9750­9755 PNAS, 2005, 102, 15459­15464 PNAS, 2005, 102, 8955­8960 PNAS, 2006, 103, 12999­13003 PNAS, 2006, 103, 1810­1815 Proteomics 2003, 3, 1154­1161 Science 298 (2002) 1793-1796 Science 303, 1831-1838 (2004) Science, 286, 1999, 304-306 Toxicon 42 (2003) 391­398. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bio-Algorithms and Med-Systems de Gruyter

Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework

Loading next page...
 
/lp/de-gruyter/automatic-protein-abbreviations-discovery-and-resolution-from-full-2sip7JhhaP

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
de Gruyter
Copyright
Copyright © 2012 by the
ISSN
1895-9091
eISSN
1896-530X
DOI
10.2478/bams-2012-0002
Publisher site
See Article on Publisher Site

Abstract

This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. Corresponding author: Prof. Fabio Polticelli Department of Biology, University of Roma Tre Viale G. Marconi 446, 00146 Rome, Italy E-mail: polticel@uniroma3.it Phone: +39-06-57336362 Fax: +39-06-57336321 The PRAISED implementation has been tested against several known annotated corpora, such as the Medstract Gold Standard Corpus, the AB3P Corpus, the BioText Corpus and the Ao and Takagi Corpus, obtaining significantly high levels of recall and extremely fast performance, while also keeping promising levels of precision and overall f-measure, in comparison to the most relevant similar methods. This comparison has been carried out up to Phase 2, since those methods stop at expanding abbreviations, without performing any entity recognition. Instead, the entity recognition performed in the last phase provides PRAISED with an effective strategy for protein discovery, thus moving further from existing context-free techniques. Furthermore, this implementation also addresses the complexity of full-text papers, instead of the simpler abstracts more generally used. As such, the whole PRAISED process (Phase 1, 2 and 3) has been also tested against a manually annotated subset of full-text papers retrieved from the PubMed repository, with significant results as well. KEYWORDS: proteins, abbreviations, data mining, extraction, resolution. Introduction Abbreviations, in the form of acronyms, aliases or simply short versions for longer names, are commonly scattered all over the scientific publications. Researchers have to constantly deal with abbreviations both when reading and writing a paper, either for understanding and tracing back their meaning or for coming up with new ones to be included in their next publication. In the biomedical community, consistent nomenclature of proteins and their corresponding abbreviations is of utmost importance for knowledge dissemination and gene/protein sequence database searching and retrieval. However, there are no generally accepted rules for naming novel proteins and abbreviating the corresponding names, and writing guidelines or suggested best-practices are often ignored and disregarded. There are indeed many ambiguities with different proteins with similar names sharing the same abbreviation. Therefore, an additional layer of complexity has been placed on top of the already unstructured data which any plain text is usually made of. This is a major problem both in the biomedical literature and in sequence databases, generating confusion and errors. Furthermore, genomic initiatives have led to the discovery of a tremendous number of novel proteins and, as a consequence, to an explosion of protein name abbreviations whose correct resolution requires a clean and effective strategy. Tackling this problem is no easy feat, given the data deluge itself and the inner complexity of the biological domain, which features thousands of often ambiguous short names and acronyms for proteins as well as for small molecules, chemicals and so forth. Further complexity is represented by the increasing use of protein names made up of numbers and letters which have little or no reference to the actual structural and functional class of the protein itself (e.g. p53BP2, p53 binding protein 2). Generalist resources such as the web portals Abbreviations.com or Acronymfinder.com, put to a test with some common protein abbreviations, demonstrate to be unable to address the complexity and the chaotic character of protein abbreviations, and thus more complex approaches are needed to solve this problem. In addition, static resources are inadequate for a field such as protein science, in which the number of known proteins increases almost exponentially with time. In this area, several research groups have proposed a certain number of methodologies for trying to discover acronyms within a source text, ranging from general approaches to more specific techniques. These include the use of regular expressions [1], linguistic cues and pattern-based recognitions [2,3, 4,5,6,7], as well as machine learning algorithms, natural language processing and mixed methods [8,9,10]. Earlier approaches [2,3,4] limited the identification patterns to all uppercase words or words with at least an uppercase letter, and the resolution patterns to strictly adjacent words whose initial letters only could participate in the abbreviation matching. Others restricted acronyms to parenthesisenclosed words [1, 6], and placed limits on capital letters and word length [5]. The well-known algorithm proposed in [6] achieved 96% precision and 82% recall on the Medstract Gold Standard Corpus. [5] reached 98% precision and 95% recall on a very small testing set. [9] focused on single words between brackets using dynamic programming to check for an abbreviation explanation to the left of the bracket-enclosed word, scoring 80% precision and 83% recall on the Medstract corpus. Some subsequent proposals [7,10] shifted their attention more specifically towards the biomedical world. In detail, [7] used pattern-matching rules based on the correspondence between the initial characters of contiguous words and the abbreviation letters, and obtained an average 95% precision and 70% recall on a small set of biomedical papers. [10], instead, took advantage of the method proposed in [11] to first identify proteins within a text, and then, assuming the identified protein names were all correct, to try and map those names to their corresponding abbreviations. They claimed 98% precision and 96% recall on biomedical abstracts, under however the assumption of correctness of the previous protein name identification step. More recently, [12] tried to build an abbreviation repository by using a machine learning approach to extract and resolve short form-long form (also called SF-LF from now on) pairs. The tool they developed (which is still available online, unlike most of the systems just mentioned) focused nevertheless on paper abstracts, and built its abbreviation archive accordingly. Its resolution rules fall short when a complex full-text paper is provided as its input, showing less accurate results. Furthermore, they stopped at matching abbreviations with their corresponding explanations, without trying to resolve the matched explanations as entities of the biological domain. A similar tool, based instead on custom patterns, is mentioned in [13] and is currently available as well. The study in [14], although unpublished at the time of this writing, accurately compared the resolution methods of [6], [12] and [13] on the BioText, AB3P and A&T corpus, being the major currently available approaches to the abbreviation resolution problem. This study will be later used in the paper as a basis to compare the results obtained. Finally, some broader-scoped scientific initiatives in the biomedical domain (like the effort made in [15] are worth mentioning, embracing a substantial number of information retrieval problems, like text categorization, question-answering and the automatic search for logical/conceptual relationships among biomedical concepts. In comparison, the approach proposed in this paper concentrates on a more specific information extraction task, so that it might somewhat order the chaotic domain of protein abbreviations. To sum it up, most of the existing approaches tackling a similar problem, as pointed out so far, impose strong constraints upon the candidate abbreviations, or employ fixed recognition patterns when trying to discover their origin within the text, thus severely limiting recall. As a consequence, their scope is usually restricted to abstracts only, where a narrower variety of abbreviation forms can be detected. Also, performance in terms of execution time is seldom mentioned or not at all, and the overall results are not so easily comparable, for often modified corpora are used to test the resolution algorithms. The major abbreviation resolution approaches, as said earlier, also stop at the abbreviation expansion and did not bother matching their explanations with biomedical entities. This paper describes a lightweight strategy for identifying and resolving protein abbreviations found in the full text of biological papers, implemented in a framework for the Identification, Disambiguation and StoragE of Protein-Related Abbreviations (PRAISED). The PRAISED execution consists in a three-phase process where (i) candidate abbreviations are detected within a full text, based on lexical clues and exclusion rules (Abbreviation Identification); (ii) abbreviations (SF) are matched with their potential explanation (LF), using syntactical and semantic criteria combined with fitting optimization techniques (Abbreviation Resolution); and (iii), resulting SF-LF pairs are sorted out according to the domain of interest (Protein Name Identification). A preliminary outline of the first two phases of this process has been previously introduced in poster paper form in [16]. In this regard, it is important to underline that the PRAISED discovery process has been designed for and built upon the biological domain; even so, as far as the abbreviation identification and resolution steps are concerned, it features strategies which might in principle be applied and extended to other domains of interest, such as medical abbreviations. In fact, only a minimal domain-specific tuning is performed during the first two phases, and most of the criteria used are not strictly tied to the considered domain. The PRAISED framework (up to phase 2) has been tested against four annotated corpora: the Medstract Gold Standard [17], the AB3P [18], the BioText [19] and the A&T Corpus [20], obtaining results comparable to other available abbreviation resolution methods, despite the computationally lightweight character of the PRAISED implementation. Specifically, very high levels of recall have been detected, often higher than similar approaches, along with promising results of precision and consequently f-measure, the latter being a fundamental parameter to assess the quality of information extraction systems. The whole PRAISED execution has been tested against a manually annotated set of randomly chosen full-text articles, where it clearly showed all of its resolution potential. From this viewpoint, the uniqueness of PRAISED relies on its ability of approaching the problem of abbreviation identification and resolution on full-text articles. In fact, PRAISED is able to achieve results on full-text papers comparable to those obtained with other tools on the much simpler problem of paper abstracts. This is not a matter of a mere difference in length. In fact, full-text papers display a far higher level of complexity, in terms of the disparate forms of the protein names and abbreviations featured in them, than their respective abstracts. Methods As already reported in the Introduction section, the PRAISED execution consists of a three-phase process. These phases, along with an overview of the system architecture, are depicted in Figure 1. In the first phase (Abbreviation Identification) candidate abbreviations are detected within a full text; in the second phase (Abbreviation Resolution) abbreviations (SF) are matched with their potential explanation (LF); in the third phase (Protein Name Identification) resulting SF-LF pairs are sorted out according to the domain of interest. The methodological details of each phase are described in detail below. Phase one: Abbreviation Identification The first phase of the PRAISED process consists in identifying abbreviations within a text and is mainly based on a series of syntactical and lexical checks. Overall, this phase involves a tokenization of the input text, a ranking assignment for each of the considered words and the selection of candidate abbreviations based on a score threshold. Figure 1. System architecture of the PRAISED framework and high-level overview of the whole abbreviation discovery process. Exclusion rules A preliminary cleaning of the text is performed in order for a certain set of skippable words not to be passed to the actual ranking process. This is done first by applying general-purpose rules, which are meant to remove stop-words (and, of, or etc.), a list of known, recurring non-acronym words (Fig., Table etc.), and those derived from some known patterns to be excluded (like words containing no letters, or one character-long terms). Besides, a collection of domain-oriented exclusion rules are also applied, so that the subset of words to be later passed to the resolution phase is even more accurate. Thus, amino acids (Tyr382, Asp383 etc.), ions (Fe2+ etc.) and nucleotides (GTX, A(G/A)(A/G/T) etc.) are told apart, which might be wrongly captured by the subsequent lexical checks for possessing an abbreviation-like form. These rules, specifically tuned for biological papers can be swapped in and out and other rules for different domains can be defined. After the application of the exclusion rules, the ranking process takes place, via the consecutive checks listed below. Length check and decomposition of compound words The first check is related to the word's length, in order to establish whether the considered word has a reasonably short length. If that is the case, rank is increased accordingly, since there is a very high chance for an abbreviation to be sufficiently short. Here, the presence of special suffixes and compound words, especially delimited by linguistic elements like slashes, is checked as well. If suffixes are found (the most notable of which being like), they are removed and the remaining word is ranked according to the subsequent metrics. If the system is dealing with a composite abbreviation (e.g. LysoPAFAT/LPCAT2), the latter is split accordingly and the single words building it up are individually ranked, eventually resulting in a ranking that will be the computed average of the individual ones. Plain bracket check The second check tests a word for the presence of plain brackets, either left, right or both. Rank is increased upon successful discovery of parentheses, proportionally to their number (a word perfectly enclosed within brackets is very likely to be an abbreviation), and the fully or partially bracket-enclosed words are cleaned of brackets and passed to the final ranking stage. Multiple lexical checks The final ranking step consists of a composition of syntactical metrics based on linguistic elements, and is made up of the intertwined checks (where a single word can pass several checks) listed in Table 1. Those words ending up with a rank 6 after all the aforementioned checks are selected and passed to Phase 2. Along with each candidate abbreviation, a set of contiguous words are stored, to be subsequently used in the resolution phase as the search space for the potential abbreviation origin. The size s of each set of words (either left or right with respect to the candidate abbreviation) is dependent on the candidate abbreviation's length, resulting in n + k, where n is the abbreviation's length and k is a configurable factor. Manual testing and tuning have resulted in the current assignment of the value 2 for such a parameter. Phase 2: Abbreviation Resolution The second phase of the process is responsible of trying to match a candidate abbreviation with its potential explanation among its contiguous words previously stored. Pre-processing: detection of the abbreviation's building elements Before the abbreviation resolution passes, a pre-processing step is performed, where the considered abbreviation is split into its basic sub-elements, roughly corresponding to each of its characters. In fact, letters are individually split but contiguous digits are treated as a single unit. For instance, the elements resulting from Cyp33 will be C, y, p and 33. Table 1. Syntactical metrics checks used for the final ranking step of the abbreviation identification phase (Phase 1). UL, uppercase letters; LL, lowercase letters. Syntactical check all UL all UL + at most a trailing lowercase s all UL + numbers with at least a letter all UL + dashes or underscores with at least a letter more UL than dashes or underscores (if the previous check succeeds) all UL or numbers or dashes or underscores with at least a letter more UL than numbers or dashes or underscores (if the previous check succeeds) some LL more UL than LL (if the previous check succeeds) initial UL (if "some LL" check succeeds) some numbers (if the previous check succeeds) Example ABCD ABCDs 1ABCD, A123, 1A32 A{ -ABC 1A{ -ABC 1-2 abcdA aBCDEf) Abcde Ab12cd3 Rank increase +4 +3 +4 +1 +2 +2 +2 +0 +3 +1 +3 The purpose of the subsequent resolution phase is to match each sub element with a term among the contiguous words the system has stored: the resulting match ratio mr will therefore be computed as (Em/E) * 100, where Em is the number of matched elements and E is the total number of elements. The search space within which these matches are to be looked for will first be the abbreviation's previous words, properly tokenized by dashes or other relevant connectors. Empirical tests have shown that the likelihood of an abbreviation being explained is greater among the words that immediately precede it, while it is lower among the words that immediately follow it. If mr 50 is reached at the end of all the passes, then the whole resolution process is repeated by taking into account the words that follow the abbreviation as the new search space. In this latter case, the threshold for a successful resolution is set at mr 50 as well. First pass: matching initial characters The actual resolution process begins by scanning the various elements of the candidate abbreviation to check whether there are terms starting with those elements within the search space. The seeming simplicity of this step may hide its effectiveness, as most "standard" abbreviations, from a wide range of different domains, fall into the category "A Beautiful Concept (ABC)," and are therefore correctly resolved right after this check. Second pass: checking for trailing "s" Many abbreviations are cited in scientific papers as plural nouns and are consequently explained as such. Dendritic epidermal T cells (DETCs), tolllike receptors (TLRs), yeast artificial chromosomes (YACs) are all examples in this regard. The missing trailing lowercase "s", which the first pass could not obviously match with any initial character within the search space, is checked for its presence by the second pass; if the last matched word is actually a plural noun (as in the previous examples cells, receptors and chromosomes all are) the match ratio is updated accordingly. Third pass: checking for spelled-out numbers The explanation of an abbreviation can also contain spelled-out cardinal or ordinal numbers that might correspond to actual digits featured as elements of the abbreviation itself, in any number of positions (usually at the beginning or at the end of the explanation). Third plant homeodomain (PHD3) is a significant example of these particular cases. The third resolution pass is meant to check for the correspondence between digits as abbreviation sub-elements and their spelled-out version in the search space. Fourth pass: combining lowercase letters After the first three passes, there still might be syntactically unmatched elements of the considered abbreviation. An interesting subset of abbreviations is structured in a way that a multi-letter prefix is used instead of a single initial character. This is the case of abbreviations like glutamate receptor (GluR), lidocaine (Lid) or murine leukemia virus (MuLV). The fourth pass tries to find unresolved elements of the abbreviation that are actually lowercase letters, and checks whether they can be combined with previously resolved elements (usually uppercase letters) to form the prefix of some word within the search space, generally already matched with another abbreviation element. As in one of the examples shown above, the unresolved elements l and u of (GluR) will be combined together with the already resolved G, and as a compound will be associated with glutamate, to which the element G had already been matched. Elements resolved in this fashion will consequently be deemed matched as well, increasing the overall match ratio as a consequence. Fifth pass: combining uppercase letters This pass is similar to the previous one, for it takes into consideration unmatched elements of the considered abbreviation, which might be combined with other, previously resolved elements in order to form the prefix of a matched word. The difference lies in its focus on unresolved uppercase letters. For instance, in vasodilator-stimulated phosphoprotein (VASP), the unmatched element A will be combined with its previous element V and matched with vasodilator, which the element V had already been associated with. Match ratio will be increased accordingly. Sixth pass: scattered lowercase letters This pass checks whether there are some unresolved lowercase letters of the considered abbreviation which might be found within the word matched with their previously resolved element. Relevant examples are transferrin receptor (TfR) and allatostatin receptor (AlstR), whose unmatched lowercase letters in the middle (f for the former and s, t for the latter) can actually be found within the matched word of the resolved element placed immediately before them. As in the passes before, match ratio will be increased accordingly if such correspondences are detected. Relevance of bracket-enclosed words for broader matching criteria The application of some of the aforementioned steps might sometimes not produce an optimal abbreviation resolution, typically due to their coarsegrained assumptions which could lead to a number of incorrectly matching false positives. This is actually the case for passes 5-6, whose "bold" matching criteria might end up incorrectly matching abbreviation elements with words within the search space, thus wrongly increasing the overall match ratio. That is why, at the present time, the passes with the broader matching criteria (like 5-6) are performed on one condition: the considered abbreviation must be enclosed in plain brackets, thus having a higher chance of being an actual abbreviation whose explanation can be found among its contiguous words. Results indicate that, currently, this is the most successful balance between using those criteria indiscriminately (thus increasing the number of wrong matches) and not using them at all (resulting in a loss of matching precision). Correlation expressions There can be cases where an explanation, or part of it, does not match any sub-elements of the corresponding abbreviation, for no syntactical bond can be traced back among some or all the abbreviation elements and the explanation terms. This is especially true when the explanation refers to another abbreviation, or more generally when correlation expressions like as known as, also called etc. are used to link theoretically uncorrelated words. When the main passes listed so far have failed to produce a match ratio 50, some semantic rules are applied in order to detect these correlations between abbreviations and their origin. At the present time, these rules are rather simple and basically take into consideration the presence of a list of potential widespread linking expressions within the search space. If found, the match ratio undergoes an increase proportional to the proximity of the correlation expression to the abbreviation itself, with a maximum value of 51 (so that, in the worst case, it might still end up above the 50 threshold). Domain-specific passes: periodic table elements and other biological pseudo-patterns Aside from the general cases mentioned in the previous paragraph, there usually exist a number of noteworthy semantic correlations specific to the biological domain, with which the system could resolve part of otherwise unresolvable abbreviations. One of those is the correspondence between the elements of the periodic table and their respective symbol: for most metals, such a symbol is composed of an uppercase-lowercase letter pair (e.g. Cu = Copper), and is usually employed in biological papers to univocally refer to its associated element, with very minimal risk of ambiguity with other terms. By taking advantage of these known correspondences, in a specific pass the presence of a periodic table symbol is checked within the abbreviation that is being considered: if the corresponding periodic table element is detected in the search space, a semantic match is produced and the match ratio is thus increased. Similarly, it might be possible to stumble upon recurring "patterns" within some protein abbreviations syntactically unmatchable with any word in the search space. This is the case of suffixes having the form digit + p, generally referring to a variant of a known protein (e.g. yeast frataxin homolog (Yfh1p)), for which there can be found no explanation within the text. In cases like these, forcing the suffix as fully resolved (even though syntactically matching with no word) gives a ranking increase to otherwise correctly resolved abbreviations, which could have been ranked as low as a really incompletely-matching one. The boldness of this criterion, with its consequent ranking increment, falls under the category of passes 5-6, and is therefore subject to the bracket-enclosed check earlier described. Proximity correction Correctness of the matched terms can decrease when dealing with a large search space, usually resulting from very long abbreviations. In order to adjust the matching precision, a proximity correction is performed after the last syntactical pass. Basically, it tries to detect resolved elements "distant" to their next element of the abbreviation, in terms of the position of their matched word among the search space. If such a "distant" element is found, it looks for another word whose proximity to the next matched word is higher, and tries to match this word with the considered element, employing the criterion used in the first pass. If a match is established, the element's previously matched word is replaced with the new-found, nearer one. An average 30% increase of correctness has been detected for long abbreviations after applying this proximity correction. Match ratio is unaffected by such a process. Compound recurrence Within a paper, it is frequent to come across partially unresolvable abbreviations whose unmatched, usually contiguous elements are however part of another abbreviation earlier defined in the text. This happens for instance with Hansenula polymorpha amine oxidase (HPAO), Escherichia coli amine oxidase (ECAO), pea seedling amine oxidase (PSAO), where the recurring element AO may not have its explanation among the contiguous words, but may instead refer to a previously mentioned protein (as in CuAO, copper-containing amine oxidase), fully matched with its corresponding definition. Therefore the system checks whether a series of sub-elements (dubbed a "compound") of a perfectly resolved abbreviation (a "parent" abbreviation) are detected inside other, partially or poorly resolved abbreviations within a single article. If found, those matching compounds are updated by matching them with the resolved words already associated to the one belonging to the parent abbreviation. Eventually, match ratio is updated accordingly. The potential ambiguity for abbreviations containing an exact compound of another uncorrelated abbreviation within the same paper is kept to a minimum under the assumption of a perfectly resolved parent abbreviation, along with the lower bound (currently set to 2) placed on the compound's length. Results production Finally, after all the aforementioned steps are over, results are produced in terms of a list of successfully matched SF-LF pairs featuring a match ratio above a set threshold. It must be underlined that showing the potential abbreviation origin as made up of the various matched words lined up together does not always provide us with a 100% correctly retrieved explanation. There might have been words in-between that could not be explicitly matched via the steps performed. For instance, PACSINs, resolved as protein casein substrates in neurons, would be incomplete in comparison with its correct explanation protein kinase C and casein kinase 2 substrates in neurons as found in the original input text. That is why the final product of the resolution phase is the original sequence of words literally reconstructed starting from the delimiting matched words of the abbreviation. This ensures greater accuracy for abbreviation-origin pairs correctly resolved (even though initially missing some in-between words), while it does not significantly alter the result for those pairs that are incorrectly resolved to begin with. Phase 3: Protein Name Identification The third and last phase of the process has the purpose of discriminating those resolved abbreviations that actually correspond to known protein names. In detail, using as input the original sequence of words building up the explanation of a resolved abbreviation, a dictionary-based matching step is carried out. A local copy of the UniProt database [21] is used as a source repository for all the known proteins, and an indexing and a subsequent search step in order to match the input sequence with one of the records within the database is performed. The result is a list of candidate protein names, each with a certain score: those scoring higher are more likely to be the actual proteins appearing in the input sequence (a score of 100 means a perfect match). Finally, a refinement is performed by considering the string similarity between the input words and the candidate protein names, so that the score of those with a greater proximity to the input is increased or maximized. Further details of this phase are described below. Index building and search step In order to effectively find a potential protein within an abbreviation explanation produced in Phase two, an index is built upon the data stored in a local copy of UniProt to be used as a search basis for relevant matches. The open-source Apache Lucene library [22] is employed for fulfilling this task. The index is built only once for any database release. For each UniProt record, the protein name "as-is" and its potentially expanded form in terms of its spelled-out elements, if any, is stored within the index. Thus, for example, in the case of p-21 activated kinase I, the expanded form "p-twenty one activated kinase first" is stored as well. The actual search is performed against the created index, in the form of two queries: one will look for a match between the input words and the index fields representing the protein names, and the other will do the same between the spelled-out versions of the input words and the spelled-out version of each protein name. Case-insensitiveness, stop-word elimination, word stemming (i.e. bringing a word back to its lemma form), and other optimizations related to the number and position of the single words within the input text, are left to Lucene. This search mechanism will assign a score > 0 to those protein names matching (totally or partially) the input words; here a threshold is applied so that only the most relevant matches are returned as candidate protein names. It must be underlined that the two queries are executed as parallel searches, and as such there might be duplicates between their two result sets: such duplicates are then merged by adding up their respective score, so that they could climb some positions in the final ranking (if a candidate appears in both result sets, it is likely to be more relevant than others). Distance-based refinement At the end of the Lucene-based search step, a certain number of the results obtained are already as expected: for those abbreviation explanations that really referred to a protein, the first candidate protein name in the returned list should be the corresponding protein itself; for those coming from other domains, the candidate list should contain low-score values (usually at least < 60 in a normalized scale 0-100) or be empty altogether. However, for specific input texts, an unsatisfactory situation can occur. This is due to the fact that Lucene tends to assign the same score value to any protein name candidate that includes the same number of input words. For example, given the abbreviation GST, which after the Abbreviation Resolution phase has been correctly resolved as Glutathione S-transferase, the queries performed by Lucene produce a list of high-scoring protein name candidates, which are all the variants of this protein stored in UniProt (Glutathione Stransferase APIC, Glutathione S-transferase alpha-1 etc.). However, Lucene will assign the same score to all of the returned candidates containing the three input words Glutathione, S and transferase. Therefore, all of the candidates, including the perfect match (Glutathione S-transferase) will be returned with their score in a tie. As a consequence, any protein names matching exactly the input words may not necessarily appear as first in the candidate list, ordered by decreasing score, as provided by Lucene. In order to overcome this problem, a post-processing of the Lucene's output is carried out using LingPipe's implementation of the weighted edit distance and Jaccard distance [23], thanks to which the system checks how much the input words and the candidate list returned by Lucene are similar. These two distance checks are aimed to encompass two different family of cases, respectively: those terms sharing the higher number of consecutive linguistic units (letters, digits, and so forth) and overall length, and those having entire tokens in common, regardless of their individual position. Distance ranges have been established so that for a distance value falling inside the given range, the corresponding score of the considered candidate protein name is adjusted accordingly. After the application of this technique, the correct protein name is not in the first position in less than 3% of the cases. Results The PRAISED framework has been tested against four annotated corpora: the Medstract Gold Standard, the AB3P, the BioText and the A&T Corpus. These corpora have been annotated by different research groups and are made up of a certain amount of paper abstracts extracted from PubMed, featuring several SF-LF pairs mostly from the biomedical world. More details are listed in Table 2. The results obtained were compared with three major abbreviation resolution systems, as said earlier in RELATED WORK: specifically, Schwartz and Hearst [6], ALICE [13] and BIOADI [12]. The study in [14] was preliminarily used as a basis for comparing these three systems; despite that, the same experiments have been subsequently performed by the authors as well, yielding essentially the same results (with slight discrepancies in the number of annotated SF-LF pairs detected in the corpora, which did not alter the final results nonetheless). Regarding BIOADI and ALICE, their respective online tools were used for the experiments. As far as S&H is concerned, we used the implementation provided in [24]. Table 2. Details of the corpora used to test the PRAISED system Corpus MEDSTRACT AB3P BioText A&T Papers 400 1250 1000 993 SF-LF pairs 409 1221 956 1095 Words ~78000 ~227000 ~244000 ~205000 In order for this comparison to actually take place, the PRAISED execution was tested up to Phase 2, therefore ignoring the entity recognition phase. This was done because of the nature of the compared systems, which stopped at expanding the detected abbreviations as well and did not discriminate them according to a determined context. Given the following definitions: Precision = n. of SF-LF pairs resolved / n. of SF-LF pairs retrieved Recall = n. of SF-LF pairs resolved / n. of SF-LF pairs total F-measure = 2 (Precision Recall) / (Precision + Recall) significant results in terms of recall have been obtained, outperforming the other systems in most of the cases; the overall f-measure was promising, while the precision scores stayed below those of the competitors. These results are shown in Figures 2, 3 and 4, and although interesting, they are by no means surprising. The high level of recall achieved, in fact, is due to the presence within the scanned corpora of a limited variety of abbreviation forms, as already mentioned in RELATED WORK. Since the purpose of the PRAISED implementation was to identify and resolve many variations of SFLF pairs within the full texts of scientific articles, this was a somewhat expected outcome. Even more, while processing the Medstract, AB3P, BioText and A&T corpora, an additional number of SF-LF pairs was found and resolved (most of them correctly, see Table 3) which had not been annotated at all. Out of those PRAISED could not detect and resolve, instead, it must be stressed out that less than 7% were actually protein abbreviations, and thus relevant for its ultimate purpose. PRAISED was then tested against biological full-text articles, by using a manually annotated subset of PubMed papers as the input source of the abbreviations discovery process. The subset used consists of 120 randomly chosen publications ranging in topics from structural biochemistry to molecular and cell biology, of the past twenty years (1990-2010). This subset is not particularly large, but is significant in terms of all the disparate cases of SFLF pairs featured in it. In this case, only protein abbreviations were selected for annotation. This was a great challenge indeed, since it has been opted for a plain, unadulterated processing of the articles (keeping footnotes, figure captions and so on). The results obtained were 75.6% recall and 89.1% precision (f-measure: 82.9), for the whole discovery process (Phase 1, 2 and 3). As far as the protein identification phase (Phase 3) was concerned, results showed 91.4% recall and 79.5 precision (f-measure: 85). Recall did not vary between the first and the second phase, and the precision levels of the first phase was highest, since no abbreviation was incorrectly identified as such. Figure 2. Comparison of precision values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Figure 3. Comparison of recall values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Figure 4. Comparison of f-measure values between PRAISED and the other methods, as resulting from the testing phase against four well-known annotated corpora. Table 3. Additional SF-LF pairs resolved by PRAISED not originally annotated within the corpora Corpus MEDSTRACT AB3P BioText A&T Additional SF-LF pairs 73 26 22 12 Correct 69 (94.5%) 19 (73%) 22 (100%) 10 (83.3%) For the sake of completeness, it must be also stated that, taking into account only those papers actually featuring an abbreviation list at the beginning (less than 30% of the subset collected), recall and precision were respectively 84.4% and 81.6% in terms of the protein abbreviations appearing in their lists. The lower precision of Phase 3 can be explained taking into account that the full-text article corpus features a number of biomedical non-protein abbreviations, correctly resolved by Phase two, which contain several terms appearing in many known protein names from the repository used (e.g. most chemical compounds, like "nitric oxide" appearing in the protein name "nitric oxide synthase"). As a consequence, during Phase three, several of these explanations generated non-empty candidate lists of protein names, with candidates scoring higher than the threshold set (> 60). Refining the protein name identification in this regard will undoubtedly improve the correctness of the whole process. Regarding recall, instead, the major issues lie in the resolution phase, since the < 9% recall lost in Phase 3 is only caused by the presence of malformed or misspelled/contracted protein names: most of the remaining undetected SF-LF pairs were lost during Phase 2. As discussed earlier in METHODS Section, while processing many fulltext papers, morphologically uncorrelated SF-LF pairs are encountered. In the majority of these cases, the explanation could only be inferred by human insight, since it bared no apparent syntactical (or easy semantic) bond with its respective abbreviation. For some others, hundreds of words separate the abbreviation and its explanation, thus escaping the contiguity-based approach implemented in PRAISED (including more sophisticated techniques like the compound recurrence). This underlines the need of enhancing the resolution criteria employed, in order to try and resolve a greater number of the more intricately correlated abbreviations. A semantic approach like the employment of an appropriate ontology (either dynamically constructed by the system or predefined) might come in handy for tracing back and detecting more sophisticated relationships among the terms in the scientific papers. Despite that, human intervention will still be needed for very loosely-bound SF-LF pairs, which indeed amounted to a substantial subset of those that could not be automatically resolved. An output excerpt of the discovery process can be found in Figure 5. In terms of performance, being a light-weight approach, the execution time for the whole process execution is considerably short. On a i7 QuadCore machine with a medium load from other tasks, an average of 150000 words per minute from the scientific texts used can be processed, resulting in an almost instantaneous elaboration of most individual papers. As an example, a 20-page-long paper consisting of 8000 words is processed in about 3 seconds. As already mentioned, scalability and customization have been also taken into great consideration: changing the combination of identification metrics or introducing new ones, modifying the resolution process by adding/removing passes and criteria, or fine-tuning the ranking assignment and threshold settings for other potential application domains (e.g. the medical domain), can all in principle be performed. The use of declarative, out-of-code rules in order for the customization process to be even smoother and more productive has also been planned. Figure 5. A partial screenshot of an execution output of the whole discovery process. For each resolved abbreviation, the system shows, respectively from top to bottom and from left to right: the abbreviation in its pure form (e.g. DISC), the abbreviation as found in the input text (e.g. (DISC)), the Abbreviation Resolution match ratio (AR), the Abbreviation Identification rank (AI); the abbreviation explanation obtained by the PRAISED process (e.g. death inducing signalling complex), the words contiguous to the abbreviation (e.g. forming the death inducing signalling complex), and the candidate proteins (if any) with their corresponding score in decreasing order, with the best candidate on top (e.g. 63.73 Death-inducing protein). Let us stress out some interesting elements that can be detected in this output. The lower AI rank for ICE derives from the lack of enclosing brackets in the input text: this abbreviation is correctly marked as such nevertheless, but has a rank obviously lower than those abbreviations enclosed within brackets. DISC, resolved as death inducing signalling complex, could not be perfectly matched with any known protein in the repository used; the closest match (Death-inducing protein) scores a value slightly higher than the 60 threshold, and therefore appears as a candidate in the system output, even though it might not exactly correspond to the resolved protein abbreviation (here, the system also shows the other, under-the-threshold candidates below the highest-scoring one; if none of the candidates had scored over 60,the system would have just returned an empty list and a protein match failure message). ICE, on the other hand, resolved as interleukin converting enzyme, gets correctly matched with the protein Interleukin1 beta-converting enzyme, while not scoring a perfect 100 due to the apparent inbetween terms missing from the explanation; it must be also underlined that, in this particular case, the explanation for ICE was not found among its contiguous words, but instead resulted from the application of more sophisticated criteria (like the compound recurrence and/or the semantic expressions). Discussion In this paper a methodology for the automatic identification and resolution of protein abbreviations extracted from full-text biological papers has been proposed. Such a methodology is based on a light-weight, threephase process meant to identify and resolve SF-LF pairs within a scientific article, and match them with their corresponding protein names, if any. Lexical checks and exclusion rules are employed in the identification phase, syntactical and semantic criteria are used for the resolution phase, and a dictionary-based approach is applied to match the abbreviation explanations to their potential protein names. The implementation of this process in the PRAISED framework has achieved significant results when applied to a number of known corpora and compared, up to the abbreviation resolution phase, with some of the most relevant approaches in this area. Recall value was in most of the cases higher than competitors (Figure 3), f-measure was strong (Figure 4) and those undetected SF-LF pairs only marginally amounted to protein abbreviations. Moreover, PRAISED was able to resolve correctly in most of the cases (Table 3) additional SF-LF pairs not originally annotated within the corpora. The application of the whole process to the real testing ground of fulltext papers predictably yielded lower performance values, but this is an anticipated outcome, given the highly unstructured domain approached and the challenging nature of the task itself. Nevertheless, a considerable amount of protein abbreviations were correctly identified, expanded and matched with their corresponding UniProt records. The ability of PRAISED of approaching the problem of abbreviation identification and resolution on fulltext articles is probably its most important feature and the one that makes it relevant for the biomedical community. In fact, full-text papers represent the real source of complex information which researchers are faced with in their everyday activity. Based on the empirical evidence obtained while digging into the disordered context of scattered and non-homogeneous abbreviations seen in the scientific publications, as well as from the results of the testing stage of the discovery strategy, available areas of improvement can be identified. To name a few, correlation criteria in the resolution phase can be extended and enhanced, in order to resolve more intricately correlated abbreviations. The application of a specific annotated ontology could also result in the detection of complex interrelationships among lexical terms, provided that the system takes properly advantage of it by introducing a deeper semantic approach. Moreover, the accuracy of the protein name matching phase will have to be strengthened, by extending its post-processing step in order to increase its precision when dealing with non-protein abbreviation explanations. Furthermore, the described methodology is meant to be extended from the current paper-by-paper basis to a corpus-based approach, in order to deal with cross-references among the scientific articles. Many abbreviations, in fact, are only cited but not explained within a considered paper, and thus cannot be treated as resolvable SF-LF pairs, as no definition is explicitly given for them within the context of a single article. Even more, there are situations (some of them preliminarily tackled via correlation expressions) where the explanation of an abbreviation includes other abbreviations as well, often not defined within that context. One such example is BRCT, whose expansion is BRCA1 carboxy terminal domain. Extending the PRAISED approach along these guidelines will allow to resolve also these special cases and even further improve the resolution potential of PRAISED, thus providing the users with an improved human-computer interface for extracting protein information in the literature. However, it must be kept in mind that there is a significant number of cases in which the SF-LF pairs are probably simply irresolvable. For example, one abbreviation encountered in the analysis of the annotated corpus used in this study is ABC7/Atm1p whose correct expansion, Membrane protein believed to be responsible for iron export from mitochondria has no semantic relationship with the abbreviation itself. Finally, the PRAISED system will be made available shortly, along with the list of papers building up the annotated corpus used for its testing phase and the related human-extracted SF-LF pairs. For the sake of its release, a provenance layer is currently under development, so that it will be possible to keep track of the whole discovery process and the information flow behind it. This way, the scientific community will be able to assess the correctness of the abbreviation archive with greater ease and reliability: each resolved abbreviation stored in the archive could be traced back to the paper(s) whence it was derived, along with the processing steps employed for its resolution. In the meantime, a beta version is available upon request to the authors. Rerences 1. 2. 3. Pustejovsky J, Castao J, Cochran B, Kotecki M, Morrell M, Rumshisky A. 2001. Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases. Medinfo 10. 371-375. Taghva K, Gilbreth J. 1999. Recognizing acronyms and their definitions, International. Journal on Document Analysis and Recognition. 1. 191-198. Yeates S. 1999. Automatic extraction of acronyms from text. In Third New Zealand Computer Science Research Students' Conference. 117-124. Larkey L, Ogilvie P, Price A, Tamilio B. 2000. Acrophile: An Automated Acronym Extractor and Server. In Proceedings of the ACM Digital Libraries conference. 205-214. Park Y, Byrd RJ. 2001. Hybrid Text Mining for Finding Abbreviations and Their Definitions. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing. Schwartz A, Hearst M. 2003. A simple algorithm for identifying abbreviation definitions in biomedical texts. In Proceedings of the Pacific Symposium on Biocomputing (PSB). Yu H, Hripcsak G, Friedman C. 2002. Mapping abbreviations to full forms in biomedical articles. J. Am. Med. Inform. Assoc. 9. 262-272. Nadeau D, Turney PD. 2005. A Supervised Learning Approach to Acronym Identification. In 18th Conference of the Canadian Society for Computational Studies of Intelligence. Canadian AI. 2005. Chang JT, Schutze H, Altman RB. 2002. Creating an Online Dictionary of Abbreviations from MEDLINE. J. Am. Med. Inform. Assoc. 9. 612-620. Yoshida M, Fukuda K, Takagi T. 2000. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 16, 169-175. Fukuda K, Tsunoda T, Tamura A, Takagi T. 1998. Toward Information Extraction: Identifying protein names from biological papers. In Proceedings of the Pacific Symposium on Biocomputing. 705-716. Kuo C, Ling MHT, Lin KT, Hsu CN. 2009. BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature. BMC Bioinformatics 10. p7. Ao H, Takagi T. 2005. Alice: an algorithm to extract abbreviations from medline. J. Am. Med. Inform. Assoc. 12. 576-586. Gawlik M, Strable C. Comparison of abbreviation recognition algorithms. http://acm.mscs.mu.edu/wiki-reu/index.php/User:Mgawlik Hersh W, Voorhees E. 2009. TREC genomics special issue overview. Information Retrieval 12. 1-15. Atzeni P, Polticelli F, Toti D. 2011. An Automatic Identification and Resolution System for Protein-related Abbreviations in Scientific Papers. In EvoBio. http://www.medstract.org/index.php?f=gold-standard http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/ http://biotext.berkeley.edu/data.html http://3.uvdb.dbcls.jp/ALICE/corpus download.html http://www.uniprot.org/ http://lucene.apache.org http://alias-i.com/lingpipe/ http://biotext.berkeley.edu/code/abbrev/ExtractAbbrev.java APPENDIX A Protein abbreviations2 list extracted from the full-text corpus of the manuscript by Toti et al., "Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework." Abbreviation 4-MD-2 ABC7/Atm1p ABCA1 ABCG5 ABCG5 ABCG8 ABCG8 Abl Abl ACAT2 ActA ActA ActA ActA ActA AE33 AFP AGAO AHNAK Akt Akt Alb AlstR AlstR1 AP2 AP2 AP2 AP2 Apaf-1 APC apoA1 Definition undefined Membrane protein believed to be responsible for iron export from mitochondria ATP-binding cassette transporter Undefined undefined Undefined undefined Abelson tyrosine kinase undefined Undefined Undefined undefined undefined undefined undefined undefined alpha-fetoprotein A. Globiformis CuAO undefined Undefined undefined Plasma albumin Allatostatin receptor undefined Undefined undefined undefined undefined Apoptotic protease activating factor antigen presenting cell Undefined Note that multiple occurrences of the same abbreviation are due to the presence of the abbreviation in more than one paper of the full-text corpus, that a single abbreviation can be undefined in one paper and defined in others and finally defined in different ways in different papers. Apo-AGAO apoE apoL-I APP APPs Arp2/3 ArsR Ash Aso1 ATP7A ATP7B ATPase Axcyt c' BACH1 Bem1 BmrR bRC BRCA1 BRCT BRCT2 Brn-2 BSA BSA BSAO BsHemAt BtuB C/EBP alpha C/EBP C/EBP C/EPB-ER C1R C1S CAII CAM-SLR CB1 Cbp1p c-cbl CCP CCS CD116/CD18 CD14 CD163 CD163 CD163 CD163 CD163 Apo A. Globiformis CuAO Undefined apolipoprotein L-I amyloid precursor protein Acute phase proteins undefined undefined undefined Polyamine oxidase from C. bidinii Menkes disease protein Wilson disease gene product Undefined Alcaligenes xylosodans cytochrome c' undefined undefined undefined Reaction center of photosynthetic purple bacteria undefined BRCA1 carboxy terminal domain undefined undefined undefined undefined Bovine serum amine oxidase Haem based aerotaxis Transducer sensor domain of B. subtilis GCS Vitamin B12 transporter protein undefined member of the C/EBP transcription factor family CCAT enhancer binding protein-epsilon Undefined undefined undefined Human carbonic anhydrase Carasius somatostatin-like receptor Central cannabinoid receptor heterotrimeric GTP-binding protein Corticosteroid binding protein Cellular homologue of Casitas B lineage lymphoma protooncogene product Complement control protein undefined Surface antigens undefined Macrophage cell surface receptor undefined undefined Scavenger receptor CD163 Scavenger receptor CD163 CD22 Surface antigen CD2BP2 CD2 binding protein CD36 Undefined CD44 Raft associated hyaluronate transporter CD6 undefined Cdc25 undefined Cdc42 undefined CED-3-like C. elegans protein-like ChEs Cholinesterases C-jun undefined CK Creatine kinase c-kit undefined CoaR undefined COMMD1/MURR1 undefined CooA undefined Cox Cytochrome c oxidase Cp Ceruloplasmin Cp Ceruloplasmin Cp Ceruloplasmin CP43 Undefined CP43 undefined CP43 undefined CP43 undefined CP47 Undefined CP47 undefined CP47 undefined Crk undefined CS Citrate synthase CS Citrate synthase Cu,Zn SOD Copper, zinc superoxide dismutase CuAOs Copper containing amine oxidases CueO Bacterial cupreous oxidase CueR undefined Cyt b559 Cytochrome b559 D1 undefined D1 undefined D1 undefined D2 undefined D2 undefined DAP Drostatin DCT1 Divalent cation transporter Dcytb Duodenal cytochrome b deoxyMb Undefined Diff undefined DISC Death inducing signalling complex Dlar undefined DMT1 undefined DMT1 undefined DMT1/DCT1/Nramp2 Divalent metal transporter DNA Pol V undefined Dorsal undefined DpsA undefined DT40 undefined DtxR Difteria toxin repressor ECAO E. coli CuAO EDH1 undefined EDH2 undefined EDH3 undefined EDH4 undefined EGFP undefined EHBP1 undefined EHD EPS15 homology domain eMAP undefined Ena Drosophila-Enabled protein Ena Enabled adapter protein Ena undefined Ena undefined Ena/VASP Enabled/vasodilator-stimulated phosphoprotein ENA/VASP undefined Endo-H Endoglycosidase Eps8 Epidermal growth factor receptor substrate ER Estrogen receptor (undefined?) ERK Extracellular signal-related kinase ERK1 undefined ERK2 undefined ERP60 undefined ERP72 undefined Ess1/Pin1 Peptidyl prolyl cis/trans isomerise EVH1 Ena/Vasp homology 1 EVH1 Enabled/Vasodilator stimulated phosphoprotein homology 1 EVH1 Enabled/VASP homology 1 EVH1 Enabled/VASP homology domain 1 EVH1/2 Ena/VASP homology 1/2 EVH2 undefined EVL Enabled/vasodilator-stimulated phosphoprotein-like protein EVl ENA/VASP like protein EVL Ena/VASP-like FAAH Fatty acid amide hydrolase FE65 undefined FEN2 Plasma membrane H+- pantothenate transporter Fet3 Undefined Fet3p Undefined Fet3p undefined Fet4 Undefined FixL undefined FMS1 Polyamine oxidase from S. cerevisiae Fpn FPR FPRL1 FRE1, 2 FRS2 FSH Ftr1 Ftr1p Fur Fyb/SLAP Fyb/SLAP Fyn Gamma-GCS Gamma-GTP Gap1 GAPDH GAPDH GAPDH GCN5 GCS G-CSF G-CSF GDNF Gel GFP GFP GH GHRH GHS-R GIRK1 GIRK1 Glu-C Glut4 GLUT-4 GM130 GNBP GNBP-1 GNBP-3 Gp340 GPCRs G-protein GPx GR Grb2 Grb2 GST GST GST Ferroportin Formyl peptide receptor FPR related lipoxin A4 receptor Ferrireductase Fibroblast growth factor receptor substrate 2 Follicle-stimulating hormone Membrane permease undefined Ferric uptake regulator Fyn-binding and SLP-76 associated protein T cell signalling Fyn binding protein/SLP-76-associated protein undefined Gamma-glutamyl cysteine synthetase Gamma-glutamyl transpeptidase GTPase-activating protein Glyceraldheyde-3-phosphate dehydrogenaseundefined undefined undefined Undefined Globin coupled sensor granulocyte colony stimulating factor (RESOLVED) granulocyte-colony-stimulating factor Glial cell line-derived neurotrophic factor Gelatinase Green fluorescent protein undefined Growth hormone Growth-hormone-releasing hormone G-protein coupled receptor undefined G protein-gated inwardly rectifying potassium channel undefined Insuline Responsive Glucose transporter Glucose transporter undefined Gram negative binding protein undefined undefined undefined G-protein coupled receptors GTP binding protein Glutathione peroxidise Glutathione reductase Growth factor receptor-bound 2 undefined Glutathione transferase Glutathione transferase Gluthatione transferase GTPase undefined GTPase undefined GTPase undefined HasA undefined Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb Hemoglobin Hb undefined Hck undefined HCS70 undefined HCS73 undefined HDL High density lipoprotein HFE undefined HFE Undefined HIV Tat undefined HIV-1 RT HIV-1 reverse transcriptase HIV-RT HIV reverse transcriptase HLA-H Undefined HMG CoA reductase Undefined HMG-CoA reductase undefined HMGCR HMG-CoA reductase HMGCR undefined HO-1 Heme oxygenase Holo-AGAO holo A. Globiformis CuAO Homer undefined Homer undefined Hp 1 Undefined Hp 1 Variant of the Hp gene Hp 1-1 major phenotype of Hp Hp 1-1 Undefined Hp 2 Undefined Hp 2 Variant of the Hp gene Hp 2-1 major phenotype of Hp Hp 2-1 Undefined Hp 2-2 major phenotype of Hp Hp 2-2 Undefined Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin Hp Haptoglobin HP Hephaestin Hp Human hephaestin Hp undefined Hp1 undefined Hp1-1 Major phenotypic form of haptoglobin Hp1-1 Hp2 Hp2-1 Hp2-1 Hp2-2 Hp2-2 HpA0 HPAO Hpr Hpr Hs HS7C HSF-1 HSP1 HSP10 Hsp16.3 Hsp26 HSP27 HSP60 HSP60 HSP60 HSP70 HSP70 HSP70 HSP70 HSP70 HSP90 HSP90 HSPA8 hTII hTII HVA ICE-like IdeR IgA1 proteases IgG IL-6 Il-6 iNOS IP3Rs IREF2 IREG1 IREG1 Ireg1 IRP1 IRP1 IRP2 IRP2 IRPs 1 and 2 undefined undefined Major phenotypic form of haptoglobin undefined Major phenotypic form of haptoglobin undefined undefined H. polymorpha CuAO Haptoglobin related protein Haptoglobin related protein Haemosiderin undefined Heat shock transcription factor 1 undefined undefined Undefined undefined undefined undefined undefined undefined Heat shock protein undefined undefined undefined undefined undefined undefined undefined Human topoisomerase II Human topoisomerase II High voltage activated Ca++ channels Interleukin-1 converting enzyme-like undefined Undefined Undefined ploinflammatory cytokine NO synthase Inositol-1,4,5-trisphosphate receptors undefined undefined undefined Ferroportin 1 Iron regulatory proteins undefined Iron regulatory proteins undefined Iron regulatory proteins IRSp53 IscA IscS IscU Itk JH/JHs KBD L Lamp 1 Lamp 2 Lamp LasR LasR-LBD LDL LDL LDLR LDLs LEKTI Lf LFA-1 LFA-1 LfN LH LPRs LPS LRP5 LRP6 LSD1 LuxI/LuxR LuxR LVA LXR Lyn Lys M MAE2 MAO A MAO B MAOs MAP2 MaPgb MAPK MAPK MAPKAPK2 MARCO Mb Mb(s) MbCO undefined undefined undefined undefined undefined Juvenile hormone(s) undefined undefined undefined undefined Lysosome associated membrane protein Undefined Undefined Low density lipoprotein Undefined LDL receptor Low density lipoproteins Undefined Lactoferrin lymphocyte function-associated antigen 1 undefined N-terminal half-molecule of human lactoferrin Luteinizing hormone Low-density lipoprotein receptor related proteins undefined undefined undefined Histone lysine specific demetilase Undefined Undefined Low voltage activated Ca++ channels Liver X receptor undefined Lysozime undefined Malonamidase undefined Monoamine oxidase B Monoamine oxidases Microtubule associated protein 2 M. Acetivorans protoglobin Mitogen activated protein kinase Mitogen-activated protein kinase undefined undefined Myoglobin Myoglobin(s) Undefined Mb-Xe MDC1 MENA Mena Mena Mena MerR MerR MetAPs MFT MFT mGluR mGluRs MHC MLE MntR MPO MPR MR MRE11 MT MT1 MTP MTP1 Mtp1 MTP1 Myc Myo32 Myo32 Myo32 NafY Nav 1.8 NBS1 Nck Nck NCP Nedd-4 NEX4 NFBD1 NF-E2 NF-kB NF-kB NF-kB NF-kB NF-KB NGAL NifS NifU Undefined undefined Mammalian Ena Mammalian enabled Mammalian enabled adapter protein (Ena) undefined undefined undefined Methionine aminopeptidases Mitochondrial iron importer Undefined Metabotropic glutamate receptor metabotropic glutamate receptors Undefined Muconate lactonizing enzyme undefined myeloperoxidase Mannose 6-phosphate receptor Mandalate racemase undefined metallothionein MT Metallothioneins undefined undefined Metal transporter protein Metal transporter protein Ferroportin 1 undefined undefined undefined undefined undefined Tetrodoxin resistant sodium channel undefined undefined undefined Non collagen protein undefined C. elegans annexin undefined undefined Undefined undefined undefined Nuclear Factor kB undefined specific granule protein undefined undefined NikR NiSOD NK N-Mena Nod1 Nod2 NPC1 NPC1 NPC1 NPC1 NPC1l1 NPC1L1 NPC1L1 NPC2 Npw38 NPY-like Nramp2 Nramp2 NS3 NUMB Ovo OxyR P130 cas P34 cdc2 P38 MAPK p38 P53BP2 PAOs PARP1 PbrR PEBP2/CBF Pgb PGLYRP-1 PGLYRP-2 PGLYRP-3 PGLYRPs PGRPI-alpha PGRPI-beta PGRPI-L PGRPI-LB PGRPI-LC PGRPI-LE PGRPI-S PGRPI-SA PGRP-L PGRP-LC PGRP-LE PGRPs Nickel uptake regulator undefined Neurokinin-like receptors undefined undefined undefined Niemann-Pick C1 Undefined undefined undefined Niemann-Pick C1-like 1 Nieman-Pick C1 like 1 intestinal sterol transporter Nieman-Pick C1-like 1 Undefined undefined Undefined undefined undefined undefined undefined ovotransferrin undefined undefined undefined undefined undefined p53 binding protein Polyamine oxidases undefined undefined undefined Protoglobin undefined N-acetylmuramoyl-L-alanine amidase undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined Long PGRPs undefined undefined Peptidoglycan recognition proteins PGRPs PGRP-S PGRP-S1 PGRP-SA PGRP-SD PH PI3K PIKK Pin 1 PKC Plc gamma Plc gamma PmxB PNGase F PNGase-F PPAR PPLO PQBP-1 PRL Prrp PS PsbA PsbB PsbC PsbE PsbF PsbH PsbJ PsbK PsbN PsbO PsbU PsbV PsbV PsbZ PSI PSI PSI PSII PSII PSII PSII PSII PSII PSTPIP PTP1B Rab11 Rab11 Peptidoglycan recognition proteins Short PGRPs Drosophila PGRP Drosophila PGRPs Drosophila PGRP Pleckstrin homology domain Phosphatidyl inositol 3-kinase Phosphoinositide-3-kinase-related protein kinase undefined Protein kinase C undefined undefined Polymyxin B Undefined peptide N-glycosidase F Peroxisome proliferator activated receptor P. Pastoris CuAO undefined Prolactin proline-rich RNA-binding protein Photosystem PSAO Pea CuAO undefined undefined undefined undefined undefined Undefined undefined undefined undefined undefined undefined Cytochrome c550 undefined undefined Photosystem I Photosystem I Photosystem I Photosystem II Photosystem II Photosystem II Photosystem II Photosystem II Photosystem II undefined Protein tyrosine phosphatise 1B undefined undefined Rab11a Rab11Fip2 Rab4 Rab7 Rac Rad50 Raf Raf1 Ran Ran Ran RanBP1 Ras Rb21 Rccyt c' RET RhoA RT RTK RTK RXR RyRs SAA SCAP Scap SDH SdiA Sema6A-1 Sema6A-1 SERCA SFR1 SFR1 SFR1 sGC Shank SHP-2 SHSPs SM22 SMF1 Smf1 SmtB SOD SOD SOD SOD1 SOD2 Sos SoxR undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined undefined Undefined Rhodobacter capsulatus cytochrome c' undefined undefined Reverse transcriptase undefined Receptor tyrosine kinase Retinoid X receptor Ryanodine receptors Serum amyloid A undefined undefined Succinate dehydrogenase Undefined semaphorin 6A-1 Semaphorin 6A-1 Undefined undefined undefined undefined Soluble Guanylate Cyclase undefined Src homology 2 domain containing protein tyrosine phosphatise 2 Small heat shock proteins undefined Yeast manganese transporter Undefined undefined Superoxide dismutase Superoxide dismutase Superoxide dismutase undefined undefined undefined undefined SP Spa(AIM) Spreads Spred Spred-3 Sprouty Sprouty2 Sprouty3 SR-A SR-AI SR-B SR-B1 Src Src SREBP-2 SST SSTR SSTR2 SSTR23 STAT3 TASK-1 TCR TCTP/HRF TESK1 Tf TfR TfR TfR1 TfR2 TIM TIP60 TLF1 TLR2 TLR4 TLRs TNF TNF TNF TNFalpha TNF- Toll tPA TraR TRC8 TRH TrkA TRPS TRPV5 Serine protease undefined Sprouty related proteins with an EVH1 domain sprouty-related protein with EVH1 domain undefined undefined undefined undefined Scavenger receptor undefined undefined Scavenger receptor class B type 1 undefined undefined Undefined Somatostatin Somatostatin receptor undefined undefined Signal transducer and activator of transcription 3 undefined T-cell receptor undefined Testis-specific protein kinase-1 Transferrin Transferrin receptor Undefined at the first occurrence but TfR1 and Tfr2 Transferrin receptor 1 Transferrin receptor 2 Triosephosphate isomerise Histone acetyltransferase Trypanosome lytic factor-1 undefined undefined Toll like receptors Undefined Tumor necrosis factor Tumor necrosis factor Tumor necrosis factor alpha Tumor necrosis factor- undefined Tissue plasminogen activator Undefined undefined Thyrotropin-releasing hormone undefined Tryptophan synthase undefined TRPV6 TsH Tsk VAP-1 VASP VASP VASP VASP VASP VESL Vesl Vesl VP2 WASP WASP WASP WASP WH1 WIP YAP Yes Yfh1p ZntR ZO-1 -GT -GT -Aga IVA -CTx GVIA -CT undefined Thyroid-stimulating hormone undefined Vascular adhesion protein-1 undefined Vasodilator stimulated phosphoprotein Vasodilator stimulated phosphoprotein Vasodilator-stimulated phosphoprotein vasodilator-stimulated phosphoprotein undefined undefined undefined undefined Wiskott-Aldrich syndrome Wiskott-Aldrich syndrome protein Wiskott-Aldrich syndrome protein Wiskott­Aldrich syndrome protein WASP homology 1 WASP interacting protein Yes associated protein undefined Frataxin homolog undefined undefined -Glutamil transpeptidase -Glutamil transpeptidase -Agatoxin IVA -Conotoxin GVIA -Conotoxin MVIIC APPENDIX B List of bibliographic references of the papers building up the full-text corpus of the manuscript by Toti et al., "Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework." Antioxidants & Redox Signaling 7, 2005, 964-972 Archives of Biochemistry and Biophysics 362 (1999) 67­78 Archives of Biochemistry and Biophysics 428 (2004) 22­31 Archives of Biochemistry and Biophysics 444 (2005) 15­26 Archives of Biochemistry and Biophysics 498 (2010) 83­88 Biochemical and Biophysical Research Communications 282, 904­909 (2001) Biochemical and Biophysical Research Communications 303 (2003) 771­776 Biochemistry 1997, 36, 341-346 Biochemistry 1998, 37, 5394-5406 Biochemistry 2002, 41, 5963-5967 Biochemistry 2003, 42, 3464-3473 Biochemistry 2004, 43, 2829-2839 Biochemistry 2004, 43, 3289-3300 Biochemistry 2004, 43, 3979-3986 Biochemistry 2005, 44, 10914-10925 Biochemistry 2005, 44, 14725-14731 Biochemistry 2007, 46, 6097-6108 Biochimica et Biophysica Acta 1685 (2004) 8 ­ 13 Biochimica et Biophysica Acta 1757 (2006) 90­105 Biochimica et Biophysica Acta 1767 (2007) 79­87 Biochimica et Biophysica Acta 1791 (2009) 679­683 Biogerontology 3: 161­173, 2002. Biol. Chem. 383, 1667 ­ 1676, 2002 Bioorganic & Medicinal Chemistry 11 (2003) 21­29 Biophysical Chemistry 101 ­102 (2002) 145­153 Biophysical Journal 86 (2004) 3855­3862 Blood (2006) 108, 2946-2949 Blood (2006) 108, 353-361 BMC Biology 2007, 5:17 Cell 111, 733­745, 2002 Cell 123, 1213­1226, 2005 Cell 97, 471­480, 1999 Cell Metabolism 7, 508­519, 2008 Cell, 121, 1059­1069, 2005 Cell. Mol. Life Sci. 57 (2000) 1970­1977 Cell. Mol. Life Sci. 59 (2002) 1413­1427 Cellular Microbiology (2006) 8, 1059­1069 Current Biology 11 (2001) R399-R401 Current Enzyme Inhibition, 2005, 1, 85-95 Current Opinion in Cell Biology 2002, 14:88­103 Current Opinion in Structural Biology 2004, 14:447­453 Current Opinion in Structural Biology 2004, 14:765­774 EMBO 19 (2000) 5661-5671 EMBO reports 9 (2008) 157-163 Environ. Sci. Technol. 2005, 39, 5378-5384 Eur. J. Biochem. 264, 271-275, 1999 Experimental Cell Research 315 (2009) 119 ­ 126 Experimental Gerontology 39 (2004) 1475­1484 Expert Rev. Proteomics 1, (2004), 89-100 FASEB J. 14, 231­241 (2000) FASEB J. 15, 1303-1305 (2001) FEBS Journal 272 (2005) 1727­1738 FEBS Letters 499 (2001) 256-261 FEBS Letters 513 (2002) 45-52 FEBS Letters 564 (2004) 225-228 Inorg. Chem. 1998, 37, 4030-4039 Inorganica Chim Acta. 2005, 358, 2933­2942 International Journal of Biochemistry & Cell Biology 33 (2001) 940­959 J. Cell. Mol. Med. 8, 2004, 201-212 J. Med. Chem. 2006, 49, 3800-3808 J. Med. Chem. 2006, 49, 7754-7765 J. Mol. Biol. (2002) 317, 41-72 J. Mol. Biol. (2002) 324, 105­121 J. Mol. Biol. (2003) 328, 505­515 J. Mol. Biol. (2004) 338, 103­114 J. Mol. Biol. (2005) 347, 565­581 J. Mol. Biol. (2005) 350, 987­996 J. Mol. Biol. (2007) 371, 1038­1046 J. Neurochem. 67, 2155--2163 (1996) J. Peptide Res., 2003, 61, 202­212 J. Phys. Chem. B 2004, 108, 12990-12998 J. Phys. Chem. B 2005, 109, 19929-19935 Journal of Biological Chemistry 271, 18379­18386, 1996 Journal of Biological Chemistry 275, 19906­19912, 2000 Journal of Biological Chemistry 275, 27940­27946, 2000 Journal of Biological Chemistry 277, 17209­17216, 2002 Journal of Biological Chemistry 277, 39937­39943, 2002 Journal of Biological Chemistry 279, 31842­31853, 2004 Journal of Biological Chemistry 279, 31873­31882, 2004 Journal Of Biological Chemistry 281, 14241­14249, 2006 Journal Of Biological Chemistry 281, 36477­36481, 2006 Journal of Biological Chemistry 282, 1072­1079, 2007 Journal of Biological Chemistry 282, 13592­13600, 2007 Journal of Cell Science 117, 2631-2639, 2004 Journal of Experimental Biology 203, 841­856 (2000) Journal of Lipid Research 50, 2009, 1653-1662 Journal of Molecular Graphics and Modelling 19, 146­149, 2001 Journal of Neurochemistry, 2003, 85, 610­621 Mitochondrion 10 (2010) 83­93 Mol. Biol. Evol. 18(2):120­131. 2001 Mol. BioSyst., 2005, 1, 79­84 Molecular Biology of the Cell 17, 163­177, 2006 Nature 402 (1999) 656-660 Nature 409 (2001) 198-201 Nature 438 (2005) 1040-1044 Nature 450 (2007) 1201-1206 Nature 454 (2008) 1123-1127 Nature 454 (2008) 1127-1132 Nature Structural and Molecular Biology 12 (2005) 582-588 Nature, 389 (1997) 753-758 Neurobiology of Aging 21 (2000) 455­462 Neurology 2004;63:1912­1917 Nucleic Acids Research, 2004, 32, D129-D133 Photosynthesis Research (2005) 84: 153­159 Photosynthesis Research 77: 35­43, 2003 Physiol Rev 84:41-68, 2004 PNAS, 1999, 96, 2042­2047 PNAS, 2001, 98, 7760­7764 PNAS, 2002, 99, 1264­1269 PNAS, 2002, 99, 3505­3510 PNAS, 2003, 100, 9750­9755 PNAS, 2005, 102, 15459­15464 PNAS, 2005, 102, 8955­8960 PNAS, 2006, 103, 12999­13003 PNAS, 2006, 103, 1810­1815 Proteomics 2003, 3, 1154­1161 Science 298 (2002) 1793-1796 Science 303, 1831-1838 (2004) Science, 286, 1999, 304-306 Toxicon 42 (2003) 391­398.

Journal

Bio-Algorithms and Med-Systemsde Gruyter

Published: Jan 1, 2012

References