Deep linguistic processing

Last updated

Deep linguistic processing is a natural language processing framework which draws on theoretical and descriptive linguistics. It models language predominantly by way of theoretical syntactic/semantic theory (e.g. CCG, HPSG, LFG, TAG, the Prague School). Deep linguistic processing approaches differ from "shallower" methods in that they yield more expressive and structural representations which directly capture long-distance dependencies and underlying predicate-argument structures. [1]
The knowledge-intensive approach of deep linguistic processing requires considerable computational power, and has in the past sometimes been judged as being intractable. However, research in the early 2000s had made considerable advancement in efficiency of deep processing. [2] [3] Today, efficiency is no longer a major problem for applications using deep linguistic processing.

Natural language processing field of computer science and linguistics

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Contents

Contrast to "shallow linguistic processing"

Traditionally, deep linguistic processing has been concerned with computational grammar development (for use in both parsing and generation). These grammars were manually developed, maintained and were computationally expensive to run. In recent years, machine learning approaches (also known as shallow linguistic processing) have fundamentally altered the field of natural language processing. The rapid creation of robust and wide-coverage machine learning NLP tools requires substantially lesser amount of manual labor. Thus deep linguistic processing methods have received less attention.

Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

However, it is the belief of some computational linguists[ who? ] that in order for computers to understand natural language or inference, detailed syntactic and semantic representation is necessary. Moreover, shallow methods may lack human language 'understanding'. While humans can easily understand a sentence and its meaning, shallow linguistic processing might lack human language 'understanding'. For example: [4]

Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word infer means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in Europe dates at least to Aristotle. Deduction is inference deriving logical conclusions from premises known or assumed to be true, with the laws of valid inference being studied in logic. Induction is inference from particular premises to a universal conclusion. A third type of inference is sometimes distinguished, notably by Charles Sanders Peirce, distinguishing abduction from induction, where abduction is inference to the best explanation.

Semantic analysis is a method for eliciting and representing knowledge about organisations.

a) Things would be different if Microsoft were located in Georgia.

In sentence (a), a shallow information extraction system might infer wrongly that Microsoft's headquarters was located in Georgia. While as humans, we understand from the sentence that Microsoft office was never in Georgia.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

b) The National Institute for Psychology in Israel was established in May 1971 as the Israel Center for Psychobiology by Prof. Joel.

In sentence (b), a shallow system could wrongly infer that Israel was established in May 1971. Humans know that it is the National Institute for Psychobiology that was established in 1971.
In summary of the comparison between deep and shallow language processing, deep linguistic processing provides a knowledge-rich analysis of language through manually developed grammars and language resources. Whereas, shallow linguistic processing provides a knowledge-lean analysis of language through statistical/machine learning manipulation of texts and/or annotated linguistic resource.

An annotation is a metadatum attached to location or other data.

Sub-communities

"Deep" computational linguists are divided in different sub-communities based on the grammatical formalism they adopted for deep linguistic processing. The major sub-communities includes the:

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.

The shortlist above is not exhaustively representative of all the communities working on deep linguistic processing.

See also

Related Research Articles

In linguistics, syntax is the set of rules, principles, and processes that govern the structure of sentences in a given language, usually including word order. The term syntax is also used to refer to the study of such principles and processes. The goal of many syntacticians is to discover the syntactic rules common to all languages.

In computational linguistics, word-sense disambiguation (WSD) is an open problem concerned with identifying which sense of a word is used in a sentence. The solution to this problem impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, inference.

Natural-language understanding (NLU) or natural-language interpretation (NLI) is a subtopic of natural-language processing in artificial intelligence that deals with machine reading comprehension. Natural-language understanding is considered an AI-hard problem.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Charles J. Fillmore was an American linguist and Professor of Linguistics at the University of California, Berkeley. He received his Ph.D. in Linguistics from the University of Michigan in 1961. Fillmore spent ten years at The Ohio State University and a year as a Fellow at the Center for Advanced Study in the Behavioral Sciences at Stanford University before joining Berkeley's Department of Linguistics in 1971. Fillmore was extremely influential in the areas of syntax and lexical semantics.

Linguistic competence is the system of linguistic knowledge possessed by native speakers of a language. It is distinguished from linguistic performance, which is the way a language system is used in communication. Noam Chomsky introduced this concept in his elaboration of generative grammar, where it has been widely adopted and competence is the only level of language that is studied.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Sentence processing takes place whenever a reader or listener processes a language utterance, either in isolation or in the context of a conversation or a text.

Glue semantics, or simply Glue, is a linguistic theory of semantic composition and the syntax–semantics interface which assumes that meaning composition is constrained by a set of instructions stated within a formal logic. These instructions, called meaning constructors, state how the meanings of the parts of a sentence can be combined to provide the meaning of the sentence.

In computer science, SYNTAX is a system used to generate lexical and syntactic analyzers (parsers) for all kinds of context-free grammars (CFGs) as well as some classes of contextual grammars. It has been developed at INRIA (France) for several decades, mostly by Pierre Boullier, but has become free software since 2007 only. SYNTAX is distributed under the CeCILL license.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

The following outline is provided as an overview of and topical guide to natural language processing:

Minimal recursion semantics (MRS) is a framework for computational semantics. It can be implemented in typed feature structure formalisms such as head-driven phrase structure grammar and lexical functional grammar. It is suitable for computational language parsing and natural language generation. MRS enables a simple formulation of the grammatical constraints on lexical and phrasal semantics, including the principles of semantic composition. This technique is used in machine translation.

In computational linguistics, the term mildly context-sensitive grammar formalisms refers to several grammar formalisms that have been developed with the ambition to provide adequate descriptions of the syntactic structure of natural language.

Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing can thus be understood as extracting the precise meaning of an utterance. Applications of semantic parsing include machine translation, question answering and code generation. The phrase was first used in the 1970's by Yorick Wilks as the basis for machine translation programs working with only semantic representations.

References

  1. Timothy Baldwin, Mark Dras, Julia Hockenmaier, Tracy Holloway King, and Gertjan van Noord. 2007. The impact of deep linguistic processing on parsing technology. In Proc. of the 10th International Workshop on Parsing Technologies (IWPT-2007), pages 36–8, Prague, Czech Republic.
  2. Ulrich Callmeier. PET – A platform for experimentation with efficient HPSG processing techniques. Natural Language Engineering, 6(1):99 – 108, 2000.
  3. Hans Uszkoreit. New Chances for Deep Linguistic Processing Archived 2005-11-03 at the Wayback Machine . In Proceedings of COLING 2002, pages xiv–xxvii, Taipei, Taiwan, 2002.
  4. U. Schafer. 2007. ¨ Integrating Deep and Shallow Natural Language Processing Components – Representations and Hybrid Architectures. Ph.D. thesis, Faculty of Mathematics and Computer Science, Saarland University, Saarbrucken, Germany.