Materials for Lectures (FRVŠ: Horáček, Burgetová, Zámečníková (2011))
Lecture | Topic | Presentation | Printable |
| Formal Description of Natural Languages | | |
01 | Aims of Linguistic Theory | PDF | PDF |
02 | Generative Power of Natural Languages | PDF | PDF |
| NLP Formalisms | | |
03 | Augmented Transition Networks | PDF | PDF |
04 | Generalized Phrase Structure Grammar | PDF | PDF |
05 | Head-driven Phrase Structure Grammar | PDF | PDF |
06 | Lexical Functional Grammar | PDF | PDF |
07 | Lexicalized Tree Adjoining Grammar | PDF | PDF |
08 | Dependency Grammars | PDF | PDF |
| Statistical NLP | | |
09 | Probabilistic Context-Free Grammar | PDF | PDF |
| Alternative Approach | | |
10 | Scattered Context Grammar | PDF | PDF |
| Parsing | | |
11 | LR Parsing | PDF | PDF |
| All Lectures in One Archive | ZIP | ZIP |
Introduction
The study of natural languages is one of the key factors in the foundation of Formal Language Theory (FLT). The earliest works in the area of FLT were often motivated by the effort to understand and formally describe the syntax of natural languages. However, since then, the areas of FLT and Natural Language Processing (NLP) have become quite distinct.
NLP studies many other aspects of natural langues besides their syntax (such as morphology or semantics). It is focused mainly on practical applications, and involves various tasks (for example spell and grammar checking or machine translation; in a broader sense we can even include such tasks as speech recognition and synthesis).
FLT focuses on purely theoretical studies of formal models and their properties. Practical applications are often only briefly sketched out.
The materials available here are based on the connections between NLP and FLT. For the NLP-oriented reader, they should provide an overview of the formal background of the tools used in NLP. Several well-known models are presented. For those with primary interest in formal languages, these materials contain information about an important application area of FLT. All discussed topics are illustrated by practical examples.
Topics
Aims of Linguistic Theory PDF
Origins of computational linguistics are mostly connected with artificial intelligence (AI) applications.
The main concern is the syntax of natural languages.
Notion of transformational rules - useful for generalizations in languages (for example, we can use them to transform an active sentence in English into passive).
Generative Power of Natural Languages PDF
Where do natural languages fit in the Chomsky Hierarchy?
Natural languages are not regular.
Are natural languages context-free?
Introduction of Transformational Grammars.
References
James Allen: Natural Language Understanding. The Benjamin/Cummings Publishing Company Inc., 1995
Robert N. Moll, Michael A. Arbib, Assaf J. Kfoury: An Introduction to Formal Language Theory. Springer-Verlag, 1988
In this section, you can find several formalisms that are used in practice in NLP. Rather than trying to provide in-depth coverage of all aspects, the materials should present an overview of available alternatives. The core principles of the models are described, and illustrated by examples.
Augmented Transition Networks (ATN) PDF
One of the earliest models for the description of natural languages.
Recursive transition network with registers.
ATNs are Turing-complete.
Generalized Phrase Structure Grammar (GPSG) PDF
Created as an attempt to show that it is possible to describe natural languages in a context-free framework, without using transformations [4].
Apart from context-free rules, GPSG includes features and metarules.
GPSG itself is not used in many practical applications, but its basic principles inspired some much more successful models (such as HPSG).
Head-driven Phrase Structure Grammar (HPSG) PDF
Non-derivational, unification-based grammar with declarative constraints.
Basic HPSG type is the sign - formalized as a typed feature structure.
Widely used in practice in NLP.
Lexical Functional Grammar (LFG) PDF
Main focus is still on syntax, but LFG also covers its relation to morphology, semantics and pragmatics.
Multiple dimensions of structure. There are two fundamental levels - c-structure and f-structure.
Used as the theoretical basis of various machine translation tools.
Lexicalized Tree Adjoining Grammar (LTAG) PDF
Elementary objects are trees, not strings (TAG is a “tree-generating system”).
Two basic operations over trees: substitution and adjunction.
Strong generative capacity - TAGs can generate some context-sensitive languages.
Lexicalization means associating each elementary structure with a lexical item.
-
The term actually encompasses a whole group of particular formal models, such as:
Theory of Structural Syntax [V]
Word Grammar (WG) [II]
Functional Generative Description (FGD) [IV]
Meaning-Text Theory (MTT) [III]
Extensible Dependency Grammar (XDG) [I]
Dependency grammars are an alternative to phrase structure grammars - they capture relations between words directly, there are no phrasal nodes.
Generally more simple, robust and portable than PSG, but less informative.
Good for describing free word order languages (such as Czech).
References
James Allen: Natural Language Understanding. The Benjamin/Cummings Publishing Company Inc., 2005
-
Andrew Carnie: Syntax: A Generative Introduction. Blackwell Publishing, Oxford, 2002
Gerald Gazdar, Ewan H. Klein, Geoffery K. Pullum, Ivan A. Sag: Generalized Phrase Structure Grammar. Harvard University Press, 1985
-
Robert N. Moll, Michael A. Arbib, Assaf J. Kfoury: An Introduction to Formal Language Theory. Springer-Verlag, 1988
Carl Pollard, Ivan A. Sag: Head-Driven Phrase Structure Grammar. University of Chicago Press, 1994
Yves Schabes, Aravind K. Joshi: Tree-Adjoining Grammars and Lexicalized Grammars. 1991
Dependency grammar formalisms
Ralph Debusmann, Denys Duchier, Geert-Jan Kruijff: Extensible Dependency Grammar: A New Methodology. In Proceedings of the Workshop on Recent Advances in Dependency Grammar, p. 78-85, 2004
Richard Hudson: Word Grammar. Blackwell, 1984
Igor Mel'čuk: Dependency Syntax: Theory and Practice. State University of New York Press, 1988
Petr Sgall, Eva Hajičová, Jarmila Panevová, Jacob Mey: The meaning of the sentence in its semantic and pragmatic aspects. Springer, 1986
Lucien Tesnière: Éléments de syntaxe structurale. Editions Klincksieck, 1959
Statistical NLP
With the availability of large text corpora (both annotated and unannotated) for various languages, statistical approaches became prominent in modern NLP. Instead of the traditional approach, where we try to describe all relevant aspects of a language manually (hand-crafted rules etc.), we can apply machine learning methods to extract some generalized information from a corpus.
Most of the models discussed in the previous section can be (and have been) adapted for use in statistical NLP.
Probabilistic Context-Free Grammar (PCFG) PDF
Context-free grammar where every rule has a given probability.
While PCFG may not be often used in practice, it is presented here as a straightforward extension of CFG, which is a well-known model from classic FLT.
Also useful to demonstrate some principles from statistical NLP and machine learning (i.e. the Inside-Outside Algorithm).
Practical applications for example in grammar checking.
References
Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing. MIT Press, 1999
Alternative Approach
One of the major trends in FLT is the so called regulated rewriting. The basic idea is to increase the generative power of a given formal model (usually CFG) by adding some mathematically simple mechanism that will control (regulate) the sentence generation. A well-known example is the Matrix Grammar [1]. Another way to increase the generative capacity may be by modifying the form of the rules, as is done in the Scattered Context Grammar [2].
Such models could potentially be useful in NLP as well, because we would be able to capture even non-context-free features of natural languages easily.
Scattered Context Grammar (SCG) PDF
A modification of context-free grammar, where we rewrite an n-tuple of nonterminals simultaneously in each derivation step.
Using SCG and Transformational SCG in the description of English syntax is proposed and demonstrated in [3].
References
Jürgen Dassow, Gheorghe Păun: Regulated Rewriting in Formal Language Theory. Springer, 1989
Sheila A. Greibach, John E. Hopcroft: Scattered Context Grammars. In Journal of Computer and System Sciences, 3:233-247, 1969
-
Parsing
In practice, having a formal description the syntax of a language is usually not enough. We need to be able not only to generate well-formed sentences of a language, often we want to analyze a given sentence. For instance, we want to decide if a sentence belongs to a given language, or to obtain further information about its constituents and structure.
Syntactic analysis (parsing) is an important part of FLT. For practical applications (in NLP, but also in other areas), we need efficient parsing methods and algorithms. This is the main reason why many of the formal models used in NLP are in some way based on CFG or context-free rules - CFG is the most powerful grammar of the Chomsky Hierarchy for which we are still able to perform syntatic analysis efficiently enough.
There are several well-known parsing algorithms for CFG, such as CKY or Chart Parsing. Some algorithms are only applicable to a given subset of CFGs (for example LL Parsing), while other ones are general.
References
Acknowledgement
This work was supported by the MŠMT FRVŠ grant FR97/2011/G1.