.. -*- mode: rst -*-
.. include:: ../definitions.txt

.. _chap-parse:

=======================
7. Grammars and Parsing
=======================

.. TODO: explain that cfg grammars must not mix lexical and phrasal RHS

------------
Introduction
------------

Early experiences with the kind of grammar taught in school are sometimes perplexing.
Your written work might have been graded by a teacher who red-lined all
the grammar errors they wouldn't put up with.
Like the plural pronoun or the dangling preposition in the last sentence,
or sentences like this one which lack a main verb.
If you learnt English as a second language, you might have found it difficult
to discover which of these errors need to be fixed (or *needs* to be fixed?).
Correct punctuation is an obsession for many writers and editors.
It is easy to find cases where changing punctuation changes meaning.
In the following example, the interpretation of a relative clause as
restrictive or non-restrictive depends on the presence of commas alone:

.. ex::
  .. _rest:
  .. ex:: The presidential candidate, who was extremely popular,
          smiled broadly.
  .. _nonrest:
  .. ex:: The presidential candidate who was extremely popular smiled broadly.

In rest_, we assume there is just one presidential candidate, and say
two things about her: that she was popular and that she smiled. In
nonrest_, on the other hand, we use the description `who was extremely
popular`:lx: as a means of identifying for the hearer which of several
candidates we are referring to. 

It is clear that some of these rules are important.
However, others seem to be vestiges of antiquated style.
Consider the injunction that :lx:`however`
|mdash| when used to mean *nevertheless* |mdash|
must not appear at the start of a sentence.
Pullum argues that Strunk and White [StrunkWhite1999]_ were
merely insisting that English usage should conform to "an utterly
unimportant minor statistical detail of style concerning adverb
placement in the literature they knew" [Pullum2005However]_.
This is a case where, a `descriptive`:em: observation about language use became
a `prescriptive`:em: requirement.  In NLP we usually discard such prescriptions,
and use grammar to formalize observations about language as it is used,
particularly as it is used in corpora.

In this chapter we present the fundamentals of syntax, focusing on constituency
and tree representations, before describing the formal notation of context free
grammar.  Next we present parsers as an automatic way to associate syntactic
structures with sentences.  Finally, we give a detailed presentation of simple
top-down and bottom-up parsing algorithms available in NLTK.
Before launching into the theory we present some more naive observations about
grammar, for the benefit of readers who do not have a background in linguistics.

.. _sec-more-observations-about-grammar:

-------------------------------
More Observations about Grammar
-------------------------------

Another function of a grammar is to explain our observations
about ambiguous sentences.  Even when the individual words are
unambiguous, we can put them together to create ambiguous sentences,
as in ambiguous_.

.. _ambiguous:
.. ex::
  .. ex:: Fighting animals could be dangerous.
  .. ex:: Visiting relatives can be tiresome.

A grammar will be able to assign two structures to each sentence,
accounting for the two possible interpretations.

Perhaps another kind of syntactic variation, word order, is easier to
understand.  We know that the two sentences `Kim likes Sandy`:lx: and
`Sandy likes Kim`:lx: have different meanings, and that `likes Sandy
Kim`:lx: is simply ungrammatical.  Similarly, we know that the
following two sentences are equivalent:

.. ex::
  .. ex:: The farmer *loaded* the cart with sand
  .. ex:: The farmer *loaded* sand into the cart

However, consider the semantically similar verbs `filled`:lx: and `dumped`:lx:.
Now the word order cannot be altered (ungrammatical sentences are
prefixed with an asterisk.)

.. ex::
  .. ex:: The farmer *filled* the cart with sand
  .. ex:: \*The farmer *filled* sand into the cart
  .. ex:: \*The farmer *dumped* the cart with sand
  .. ex:: The farmer *dumped* sand into the cart

A further notable fact is that we have no difficulty accessing the meaning of
sentences we have never encountered before.  It is not difficult to concoct an
entirely novel sentence, one that has probably never been used before
in the history of the language, and yet all speakers of the language
will agree about its meaning.  In fact, the set of possible sentences is
infinite, given that there is no upper bound on length.
Consider the following passage from a children's story, containing a
rather impressive sentence:

   You can imagine Piglet's joy when at last the ship came in sight of
   him. In after-years he liked to think that he had been in Very
   Great Danger during the Terrible Flood, but the only danger he had
   really been in was the last half-hour of his imprisonment, when
   Owl, who had just flown up, sat on a branch of his tree to comfort
   him, and told him a very long story about an aunt who had once laid
   a seagull's egg by mistake, and the story went on and on, rather
   like this sentence, until Piglet who was listening out of his
   window without much hope, went to sleep quietly and naturally,
   slipping slowly out of the window towards the water until he was
   only hanging on by his toes, at which moment, luckily, a sudden
   loud squawk from Owl, which was really part of the story, being
   what his aunt said, woke the Piglet up and just gave him time to
   jerk himself back into safety and say, "How interesting, and did
   she?"  when -- well, you can imagine his joy when at last he saw
   the good ship, Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear)
   coming over the sea to rescue him...  (from A.A. Milne *In which
   Piglet is Entirely Surrounded by Water*)

Our ability to produce and understand entirely new sentences, of
arbitrary length, demonstrates that the set of well-formed sentences in
English is infinite.  The same case can be made for any human language.

This chapter presents grammars and parsing, as the formal and
computational methods for investigating and modelling the linguistic
phenomena we have been touching on (or tripping over).
As we shall see, patterns of well-formedness and ill-formedness in a
sequence of words can be understood with respect to the underlying
:dt:`phrase structure` of the sentences.  We can develop formal
models of these structures using grammars and parsers.
As before, the motivation is natural language *understanding*.  How
much more of the meaning of a text can we access when we can reliably
recognize the linguistic structures it contains?  Having read in a
text, can a program 'understand' it enough to be able to answer simple
questions about "what happened" or "who did what to whom"  Also as
before, we will develop simple programs to process annotated corpora
and perform useful tasks.

-------------------------
What's the Use of Syntax?
-------------------------

Earlier chapters focussed on words: how to identify them, how to
analyse their morphology, and how to assign them to classes via
part-of-speech tags. We have also seen how to identify recurring
sequences of words (i.e. n-grams). Nevertheless, there seem to be
linguistic regularities which cannot be described simply in terms of
n-grams.

In this section we will see why it is useful to have some
kind of syntactic representation of sentences.  In particular, we will
see that there are systematic aspects of meaning which are much easier
to capture once we have established a level of syntactic structure.


Syntactic Ambiguity
-------------------

We have seen that sentences can be ambiguous.  If we overheard someone
say :lx:`I went to the bank`, we wouldn't know whether it was
a river bank or a financial institution.  This ambiguity concerns
the meaning of the word :lx:`bank`, and is a kind of :dt:`lexical
ambiguity`.

However, other kinds of ambiguity cannot be explained in terms of
ambiguity of specific words.  Consider a phrase involving
an adjective with a conjunction:
`old men and women`:lx:.
Does `old`:lx: have wider scope than `and`:lx:, or is it the other way
round? In fact, both interpretations are possible, and we can
represent the different scopes using parentheses:

.. ex::
  .. ex::  old (men and women)
  .. ex::  (old men) and women

One convenient way of representing this scope difference at a structural
level is by means of a `tree diagram`:dt:, as shown in tree-diagram_.

.. _tree-diagram:
.. ex::
  .. ex::
    .. tree:: (NP (Adj old)
                  (NP
                     (N men)
                     (Conj and)
                     (N women)))
  .. ex::
    .. tree:: (NP (NP
                     (Adj old)
                     (N men))
                  (Conj and)
                  (NP
                     (N women)))

Note that linguistic trees grow upside down: the node labeled `s`:gc:
is the `root`:dt: of the tree, while the `leaves`:dt: of the tree are
labeled with the words.

In NLTK, you can easily produce trees like this yourself with the
following commands:
 
    >>> from nltk_lite.parse import bracket_parse
    >>> tree = bracket_parse('(NP (Adj old) (NP (N men) (Conj and) (N women)))')
    >>> tree.draw()                 # doctest: +SKIP

We can construct other examples of syntactic ambiguity
involving the coordinating conjunctions `and`:lx: and `or`:lx:, e.g.
`Kim left or Dana arrived and everyone cheered`:lx:.
We can describe this ambiguity in terms of the relative
semantic `scope`:dt: of `or`:lx: and `and`:lx:.

For our third illustration of ambiguity, we look at
prepositional phrases.
Consider a sentence like: :lx:`I saw the man with a telescope`.  Who
has the telescope?  To clarify what is going on here, consider the
following pair of sentences:

.. ex::
  .. ex:: The policeman saw a burglar *with a gun*.
         (not some other burglar)
  .. ex:: The policeman saw a burglar *with a telescope*.
         (not with his naked eye)

In both cases, there is a prepositional phrase introduced by
:lx:`with`.  In the first case this phrase modifies the noun
:lx:`burglar`, and in the second case it modifies the verb :lx:`saw`.
We could again think of this in terms of scope: does the prepositional
phrase (`pp`:gc:) just have scope over the `np`:gc:
`a burglar`:lx:, or does it have scope over
the whole verb phrase? As before, we can represent the difference in terms
of tree structure:

.. _burglar:
.. ex::
  .. ex::
    .. tree:: (S <NP the policeman>
                 (VP (V saw)
                     (NP <NP the burglar>
                         <PP with a gun>)))
  .. ex::
    .. tree:: (S <NP the policeman>
                 (VP (V saw)
                     <NP the burglar>
                     <PP with a telescope>))

|nopar|
In burglar_\ a, the `pp`:gc: attaches to the `np`:gc:,
while in burglar_\ b, the `pp`:gc: attaches to the `vp`:gc:.

We can generate these trees in Python as follows:

    >>> s1 = '(S (NP the policeman) (VP (V saw) (NP (NP the burglar) (PP with a gun))))'
    >>> s2 = '(S (NP the policeman) (VP (V saw) (NP the burglar) (PP with a telescope)))'
    >>> tree1 = bracket_parse(s1)
    >>> tree2 = bracket_parse(s2)
   
We can discard the structure to get the list of `leaves`:dt:, and
we can confirm that both trees have the same leaves.
We can also see that the trees have different `heights`:dt: (given by the
number of nodes in the longest branch of the tree, starting at `s`:gc:
and descending to the words):

    >>> tree1.leaves()
    ['the', 'policeman', 'saw', 'the', 'burglar', 'with', 'a', 'gun']
    >>> tree1.leaves() == tree2.leaves()
    True
    >>> tree1.height() == tree2.height()
    False

In general, how can we determine whether a prepositional phrase
modifies the preceding noun or verb? This problem is known as
`prepositional phrase attachment ambiguity`:dt:.
The `Prepositional Phrase Attachment Corpus`:dt: makes it
possible for us to study this question systematically.  The corpus is
derived from the IBM-Lancaster Treebank of Computer Manuals and from
the Penn Treebank, and distills out only the essential information
about `pp`:gc: attachment. Consider the sentence from the WSJ
in ppattach-a_.  The corresponding line in the Prepositional Phrase
Attachment Corpus is shown in ppattach-b_.

.. ex::
  .. _ppattach-a:
  .. ex::
     Four of the five surviving workers have asbestos-related diseases,
     including three with recently diagnosed cancer.
  .. _ppattach-b:
  .. ex::
     ::

       16 including three with cancer N

|nopar|
That is, it includes an identifier for the original sentence, the
head of the relevant verb phrase (i.e., `including`:lx:), the head of
the verb's `np`:gc: object (`three`:lx:), the preposition
(`with`:lx:), and the head noun within the prepositional phrase
(`cancer`:lx:). Finally, it contains an "attachment" feature (``N`` or
``V``) to indicate whether the prepositional phrase attaches to
(modifies) the noun phrase or the verb phrase. 
Here are some further examples:

.. _ppattachments:
.. ex::
   :: 

     47830 allow visits between families N
     47830 allow visits on peninsula V
     42457 acquired interest in firm N
     42457 acquired interest in 1986 V

|nopar|
The PP attachments in ppattachments_ can also be made explicit by
using phrase groupings as in phrase-groupings_.

.. _phrase-groupings:
.. ex::
   :: 

     allow (NP visits (PP between families))
     allow (NP visits) (PP on peninsula)
     acquired (NP interest (PP in firm))
     acquired (NP interest) (PP in 1986)

Observe in each case that the argument of the verb is either a single
complex expression ``(visits (between families))`` or a pair of
simpler expressions ``(visits) (on peninsula)``.

We can access the Prepositional Phrase Attachment Corpus from NLTK as follows:

    >>> from nltk_lite.corpora import ppattach, extract
    >>> from pprint import pprint
    >>> item = extract(9, ppattach.dictionary('training'))
    >>> pprint(item)
    {'attachment': 'N',
     'noun1': 'three',
     'noun2': 'cancer',
     'prep': 'with',
     'sent': '16',
     'verb': 'including'}

If we go back to our first examples of `pp`:gc: attachment ambiguity,
it appears as though it is the `pp`:gc: itself (e.g., `with a gun`:lx:
versus `with a telescope`:lx:) that determines the attachment. However,
we can use this corpus to find examples where other factors come into play.
For example, it appears that the verb is the key factor in ppattach-verb_.

.. _ppattach-verb:
.. ex::
   :: 

     8582 received offer from group V
     19131 rejected offer from group N

Constituency
------------

We claimed earlier that one of the motivations for building syntactic
structure was to help make explicit how a sentence says "who did what
to whom". Let's just focus for a while on the "who" part of this
story: in other words, how can syntax tell us what the subject of a
sentence is? At first, you might think this task is rather simple
|mdash| so simple indeed that we don't need to bother with syntax. In
a sentence such as `The fierce dog bit the man`:lx:
we know that it is the dog that is doing the biting. So we could
say that the noun phrase immediately preceding the verb is the
subject of the sentence. And we might try to make this more explicit
in terms of sequences part-of-speech tags.  Let's try to come up with a simple
definition of `noun phrase`:idx:; we might start off with something
like this, based on our knowledge of noun phrase chunking (Chapter chap-chunk_):

.. ex::
    `dt jj* nn`:gc:

|nopar|
We're using regular expression notation here in the form of
`jj*`:gc: to indicate a sequence of zero or more `jj`:gc:\s. So this
is intended to say that a noun phrase can consist of a
determiner, possibly followed by some adjectives, followed by a
noun. Then we can go on to say that if we can find a sequence of
tagged words like this that precedes a word tagged as a verb, then
we've identified the subject. But now think about this sentence:

.. ex:: The child with a fierce dog bit the man.

|nopar|
This time, it's the child that is doing the biting. But the tag
sequence preceding the verb is:

.. ex::
    `dt nn in dt jj nn`:gc:

Our previous attempt at identifying the subject would have
incorrectly come up with `the fierce dog`:lx: as the subject.
So our next hypothesis would have to be a bit more complex. For
example, we might say that the subject can be identified as any string
matching the following pattern before the verb:

.. ex::
     `dt jj* nn (in dt jj* nn)*`:gc:

In other words, we need to find a noun phrase followed by zero or more
sequences consisting of a preposition followed by a noun phrase. Now
there are two unpleasant aspects to this proposed solution. The first
is aesthetic: we are forced into repeating the sequence of tags (`dt
jj* nn`:gc:) that constituted our initial notion of noun phrase, and
our initial notion was in any case a drastic simplification. More
worrying, this approach still doesn't work! For consider the following
example:

.. _seagull:
.. ex:: The seagull that attacked the child with the fierce dog bit the man.

|nopar|
This time the seagull is the culprit, but it won't be detected as subject by our
attempt to match sequences of tags. So it seems that we need a
richer account of how words are *grouped* together into patterns, and
a way of referring to these groupings at different points in the
sentence structure. This idea of grouping is often called
syntactic `constituency`:dt:.

As we have just seen, a well-formed sentence of a language is more
than an arbitrary sequence of words from the language.  Certain kinds
of words usually go together.  For instance, determiners like `the`:lx:
are typically followed by adjectives or nouns, but not by verbs.
Groups of words form intermediate structures called phrases or
:dt:`constituents`.  These constituents can be identified using
standard syntactic tests, such as substitution, movement and
coordination.  For example, if a sequence of words can be replaced
with a pronoun, then that sequence is likely to be a constituent.
According to this test, we can infer that the italicised string in the
following example is a constituent, since it can be replaced by
`they`:lx:\:

.. ex::
  .. ex:: *Ordinary daily multivitamin and mineral supplements* could 
         help adults with diabetes fight off some minor infections.
  .. ex:: *They* could help adults with diabetes fight off some minor
         infections.

In order to identify whether a phrase is the subject of a sentence, we
can use the construction called `Subject-Auxiliary Inversion`:dt: in
English. This construction allows us to form so-called Yes-No
Questions. That is, corresponding to the statement in have1_, we have
the question in have2_:

.. ex::
  .. _have1:
  .. ex:: All the cakes have been eaten.
  .. _have2:
  .. ex:: Have *all the cakes* been eaten?

Roughly speaking, if a sentence already contains an auxiliary verb,
such as `has`:lx: in have1_, then we can turn it into a Yes-No
Question by moving the auxiliary verb 'over' the subject noun phrase
to the front of the sentence. If there is no auxiliary in the
statement, then we insert the appropriate form of `do`:lx: as the
fronted auxiliary and replace the tensed main verb by its base form:

.. ex::
  .. ex:: The fierce dog bit the man.
  .. ex:: Did *the fierce dog* bite the man?

As we would hope, this test also confirms our earlier claim about the
subject constituent of seagull_:

.. ex:: Did *the seagull that attacked the child with the fierce dog* bite
       the man?

To sum up then, we have seen that the notion of constituent brings a
number of benefits. By having a constituent labeled `noun phrase`:gc:,
we can provide a unified statement of the classes of word that
constitute that phrase, and reuse this statement in describing noun
phrases wherever they occur in the sentence. Second, we can use the
notion of a noun phrase in defining the subject of sentence, which in
turn is a crucial ingredient in determining the "who does what to
whom" aspect of meaning.

More on Trees
-------------

A tree is a set of connected nodes, each of which is labeled with a
category.  It common to use a 'family' metaphor to talk about the
relationships of nodes in a tree: for example, `s`:gc: is the
`parent`:dt: of `vp`:gc:; conversely `vp`:gc: is a `daughter`:dt: (or
`child`:dt:) of `s`:gc:.  Also, since `np`:gc: and `vp`:gc: are both
daughters of `s`:gc:, they are also `sisters`:dt:. 
Here is an example of a tree:

.. ex::
  .. tree:: (S (NP Lee) (VP (V saw) (NP the dog)))

Although it is helpful to represent trees in a graphical format, for
computational purposes we usually need a more text-oriented
representation. One standard method (used in the Penn Treebank)
is to use a combination of bracket
and labels to indicate the structure, as shown here:

.. doctest-ignore::
      (S 
         (NP  'Lee')
         (VP 
            (V 'saw')
            (NP 
               (Det 'the')
               (N  'dog'))))

|nopar|
The conventions for displaying trees in NLTK are similar:

.. doctest-ignore::
      (S: (NP: 'Lee') (VP: (V: 'saw') (NP: 'the' 'dog')))

In such trees, the node value is a string containing the tree's
constituent type (e.g., `np`:gc: or `vp`:gc:), while the children encode
the hierarchical contents of the tree.

Although we will focus on syntactic trees, trees can be used to encode
`any`:em: homogeneous hierarchical structure that spans a sequence
of linguistic forms (e.g. morphological structure, discourse structure).
In the general case, leaves and node values do not have to be strings.

In NLTK, trees are created with the ``Tree`` constructor, which takes a
node value and a list of zero or more children.  Here's a couple of
simple trees:

    >>> from nltk_lite.parse import Tree
    >>> tree1 = Tree('NP', ['John'])
    >>> print tree1
    (NP: 'John')
    >>> tree2 = Tree('NP', ['the', 'man'])
    >>> print tree2
    (NP: 'the' 'man')

|nopar|
We can incorporate these into successively larger trees as follows:

    >>> tree3 = Tree('VP', ['saw', tree2])
    >>> tree4 = Tree('S', [tree1, tree3])
    >>> print tree4
    (S: (NP: 'John') (VP: 'saw' (NP: 'the' 'man')))

|nopar|
Here are some of the methods available for tree objects:

    >>> tree4[1]
    (VP: 'saw' (NP: 'the' 'man'))
    >>> tree4[1].node
    'VP'
    >>> tree4.leaves()
    ['John', 'saw', 'the', 'man']
    >>> tree4[1,1,0]
    'saw'

The printed representation for complex trees can be difficult to read.
In these cases, the ``draw`` method can be very useful. 
It opens a new window, containing a graphical representation
of the tree.  The tree display window allows you to zoom in and out;
to collapse and expand subtrees; and to print the graphical
representation to a postscript file (for inclusion in a document).

.. doctest-ignore::
    >>> tree3.draw()


.. image:: ../images/parse_draw.png
   :scale: 70




Treebanks (notes)
-----------------

The ``nltk_lite.corpora`` module defines the ``treebank`` corpus reader,
which contains a 10% sample of the Penn Treebank corpus.

    >>> from nltk_lite.corpora import treebank, extract
    >>> print extract(0, treebank.parsed())
    (S:
      (NP-SBJ:
        (NP: (NNP: 'Pierre') (NNP: 'Vinken'))
        (,: ',')
        (ADJP: (NP: (CD: '61') (NNS: 'years')) (JJ: 'old'))
        (,: ','))
      (VP:
        (MD: 'will')
        (VP:
          (VB: 'join')
          (NP: (DT: 'the') (NN: 'board'))
          (PP-CLR:
            (IN: 'as')
            (NP: (DT: 'a') (JJ: 'nonexecutive') (NN: 'director')))
          (NP-TMP: (NNP: 'Nov.') (CD: '29'))))
      (.: '.'))


.. pylisting:: indent-tree

    def indent_tree(t, level=0, first=False, width=8):
        if not first:
            print ' '*(width+1)*level,
        try:
            print "%-*s" % (width, t.node),
            indent_tree(t[0], level+1, first=True)
            for child in t[1:]:
                indent_tree(child, level+1, first=False)
        except AttributeError:
            print t

    >>> t = extract(0, treebank.parsed())
    >>> indent_tree(t)
     S        NP-SBJ   NP       NNP      Pierre
                                NNP      Vinken
                       ,        ,
                       ADJP     NP       CD       61
                                         NNS      years
                                JJ       old
                       ,        ,
              VP       MD       will
                       VP       VB       join
                                NP       DT       the
                                         NN       board
                                PP-CLR   IN       as
                                         NP       DT       a
                                                  JJ       nonexecutive
                                                  NN       director
                                NP-TMP   NNP      Nov.
                                         CD       29
              .        .


NLTK also includes a sample from the *Sinica Treebank Corpus*,
consisting of 10,000 parsed sentences drawn from the
*Academia Sinica Balanced Corpus of Modern Chinese*.
Here is a code fragment to read and display one of the trees in this corpus.

.. doctest-ignore::
    >>> from nltk_lite.corpora import sinica_treebank, extract
    >>> extract(3450, sinica_treebank.parsed()).draw()

.. _sinica-tree:
.. ex::
    .. image:: ../images/sinica-tree.png
       :scale: 70

Note that we can read tagged text from a Treebank corpus, using the ``tagged()`` method:

    >>> print extract(0, treebank.tagged())
    [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'),
    ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'),
    ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
    ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Exercises
---------

#. |easy| Can you come up with grammatical sentences which have probably never
   been uttered before?  (Take turns with a partner.)  What does this tell you
   about human language?

#. |easy| Recall Strunk and White's prohibition against sentence-initial
   `however`:lx: used to mean "although".
   Do a web search for `however`:lx: used at the start of the sentence.
   How widely used is this construction?

#. |easy| Consider the sentence `Kim arrived or Dana left and everyone cheered`.
   Write down the parenthesized forms to show the relative scope of `and`:lx:
   and `or`:lx:.  Generate tree structures corresponding to both of these interpretations.

#. |easy| The ``Tree`` class implements a variety of other useful methods.
   See the ``Tree`` help documentation for more details, i.e. import
   the Tree class and then type ``help(Tree)``.

#. |easy| **Building trees:**

   a) Write code to produce two trees, one for each reading of the phrase
      `old men and women`:lx:

   #) Encode any of the trees presented in this chapter as a labeled
      bracketing and use the ``nltk_lite.parse`` module's
      ``bracket_parse()`` method to check that it is well-formed. Now
      use the ``draw()`` to display the tree.

   #) As in (a) above, draw a tree for `The woman saw a man last Thursday`:lx:.

#. |easy| Write a recursive function to traverse a tree and return the
   depth of the tree, such that a tree with a single node would have
   depth zero.  (Hint: the depth of a subtree is the maximum depth
   of its children, plus one.)

#. |easy| Analyze the A.A. Milne sentence about Piglet, by underlining all
   of the sentences it contains then replacing these with `s`:gc:
   (e.g. the first sentence becomes `s`:gc: `when`:lx` `s`:gc:).
   Draw a tree structure for this "compressed" sentence.  What are
   the main syntactic constructions used for building such a long
   sentence?

#. |soso| To compare multiple trees in a single window, we can use the
   ``draw_trees()`` method.  Define some trees and try it out:

    >>> from nltk_lite.draw.tree import draw_trees
    >>> draw_trees(tree1, tree2, tree3)

#. |soso| Using tree positions, list the subjects of the first 100
   sentences in the Penn treebank; to make the results easier to view,
   limit the extracted subjects to subtrees whose height is 2.

#. |soso| Inspect the Prepositional Phrase Attachment Corpus
   and try to suggest some factors that influence `pp`:gc: attachment.

#. |soso| In this section we claimed that there are linguistic regularities
   which cannot be described simply in terms of n-grams.
   Consider the following sentence, particularly the position of the phrase
   `in his turn`:lx:.  Does this illustrate a problem for an approach based
   on n-grams?

     `What was more, the in his turn somewhat youngish Nikolay Parfenovich
     also turned out to be the only person in the entire world to acquire a
     sincere liking to our "discriminated-against" public procurator.`
     (Dostoevsky: The Brothers Karamazov)

#. |soso| Write a recursive function that produces a nested bracketing for
   a tree, leaving out the leaf nodes, and displaying the non-terminal
   labels after their subtrees.  So the above example about Pierre
   Vinken would produce:
   ``[[[NNP NNP]NP , [ADJP [CD NNS]NP JJ]ADJP ,]NP-SBJ MD [VB [DT NN]NP [IN [DT JJ NN]NP]PP-CLR [NNP CD]NP-TMP]VP .]S``
   Consecutive categories should be separated by space.

.. recurse over tree to look for coordinate constructions (cf 4th
   example in chapter 1.1); (possible extension: callback function for Tree.subtrees())

#. |soso| Download several electronic books from Project Gutenberg.
   Write a program to scan these texts for any extremely long sentences.
   What is the longest sentence you can find?  What syntactic construction(s)
   are responsible for such long sentences?

#. |hard| One common way of defining the subject of a sentence `s`:gc: in
   English is as *the noun phrase that is the daughter of* `s`:gc: *and
   the sister of* `vp`:gc:.   Write a function that takes the tree for
   a sentence and returns the subtree corresponding to the subject of the
   sentence.  What should it do if the root node of the tree passed to
   this function is not `s`:gc:, or it lacks a subject?

--------------------
Context Free Grammar
--------------------

As we have seen, languages are infinite |mdash| there is no principled
upper-bound on the length of a sentence.  Nevertheless, we would like
to write (finite) programs that can process well-formed sentences.  It turns
out that we can characterize what we mean by well-formedness using a
grammar.  The way that finite grammars are able to describe an
infinite set uses `recursion`:dt:.  (We already came across this idea
when we looked at regular expressions: the finite expression ``a+`` is
able to describe the infinite set ``{a, aa, aaa, aaaa, ...}``).  Apart
from their compactness, grammars usually capture important structural
and distributional properties of the language, and can be used to map
between sequences of words and abstract representations of meaning.
Even if we were to impose an upper bound on sentence length to ensure
the language was finite, we would probably still want to come up with
a compact representation in the form of a grammar.

A `grammar`:dt: is a formal system which specifies which sequences of
words are well-formed in the language, and which provides one or more
phrase structures for well-formed sequences.  We will be looking at
:dt:`context-free grammar` (CFG), which is a collection of
`productions`:dt: of the form `s`:gc: |rarr| `np vp`:gc:.  This says
that a constituent `s`:gc: can consist of sub-constituents `np`:gc:
and `vp`:gc:. Similarly, the production `v`:gc: |rarr| ``'saw' | ``'walked'``
means that the constituent `v`:gc: can consist of the string
`saw`:lx: or `walked`:lx:.
For a phrase structure tree to be well-formed relative to
a grammar, each non-terminal node and its children must correspond to
a production in the grammar.


A Simple Grammar
----------------

Let's start off by looking at a simple context-free grammar.
By convention, the left-hand-side of the first production is
the `start-symbol`:dt: of the grammar, and all well-formed trees
must have this symbol as their root label.

.. _grammar1:
.. ex::

 | S |rarr| NP VP
 | NP |rarr| Det N | Det N PP
 | VP |rarr| V | V NP | V NP PP
 | PP |rarr| P NP
 |
 | Det |rarr| 'the' | 'a'
 | N |rarr| 'man' | 'park' | 'dog' | 'telescope'
 | V |rarr| 'saw' | 'walked'
 | P |rarr| 'in' | 'with'

This grammar contains productions involving various syntactic categories,
as laid out in Table syncat_.

.. table:: syncat

   ======    ====================    =====================
   Symbol    Meaning                 Example
   ======    ====================    =====================
   S         sentence                `the man walked`:lx:
   NP        noun phrase             `a dog`:lx:
   VP        verb phrase             `saw a park`:lx:
   PP        prepositional phrase    `with a telescope`:lx:
   ...       ...                     ...
   Det       determiner              `the`:lx:
   N         noun                    `dog`:lx:
   V         verb                    `walked`:lx:
   P         preposition             `in`:lx:
   ======    ====================    =====================

   Syntactic Categories

In our following discussion of grammar, we will use the following terminology.
The grammar consists of productions, where each production involves a
single `non-terminal`:dt: (e.g. `s`:gc:, `np`:gc:), an arrow, and one
or more non-terminals and `terminals`:dt: (e.g. `walked`:lx:).
The productions are often divided into two main groups.
The `grammatical productions`:dt: are those without a terminal on
the right-hand side.  The `lexical productions`:dt: are those having
a terminal on the right-hand side.
A special case of non-terminals are the `pre-terminals`:dt:, which
appear on the left-hand side of lexical productions.
We will say that a grammar `licenses`:dt: a tree if each non-terminal
`x`:gc: with children `y`:gc:\ :subscript:`1` ... `y`:gc:\ :subscript:`n`
corresponds to a production in the grammar of the form:
`x`:gc: |rarr| `y`:gc:\ :subscript:`1` ... `y`:gc:\ :subscript:`n`.

In order to get started with developing simple grammars of your own, you
will probably find it convenient to play with the recursive descent
parser demo, ``from nltk_lite.draw.rdparser.demo()``.
The demo opens a window which displays a list of grammar productions in the
lefthand pane and the current parse diagram in the central pane:

.. image:: ../images/parse_rdparsewindow.png
   :scale: 100

The demo comes with the grammar in grammar1_ already loaded. We will
discuss the parsing algorithm in greater detail below, but for the
time being you can get an idea of how it works by using the *autostep* button.
If we parse the string `The dog saw a man in the park` using
the grammar in grammar1_, we end up with two trees:

.. ex::
  .. ex::
    .. tree:: (S (NP (Det the) (N dog))
                 (VP (V saw)
                     (NP (Det a) (N man))
                     (PP (P in) (NP (Det the) (N park)))))
  .. ex::
    .. tree:: (S (NP (Det the) (N dog))
                 (VP (V saw)
                     (NP (Det a)
                         (N man)
                         (PP (P in) (NP (Det the) (N park))))))

Since our grammar licenses two trees for this sentence, the sentence is
said to be :dt:`structurally ambiguous`.  The ambiguity in question is called
a `prepositional phrase attachment ambiguity`:idx:, as we saw earlier in this chapter.
As you may recall, it is an ambiguity about attachment since the
`pp`:gc: `in the park`:lx: needs to be attached to one of two places
in the tree: either as a daughter of `VP`:gc: or else as a daughter of
`np`:gc:.
When the `pp`:gc: is attached to `vp`:gc:, the seeing event happened
in the park.  However, if the `pp`:gc: is attached to `np`:gc:,
then the man was in the park, and the agent of the seeing (the dog)
might have been sitting on the balcony of an apartment overlooking the park.
As we will see, dealing with ambiguity is a key challenge in parsing.

Recursion in syntactic structure
--------------------------------

Observe that sentences can be nested within sentences, with no limit
to the depth:

.. ex::
  .. ex:: Jodie won the 100m freestyle
  .. ex:: "The Age" reported that Jodie won the 100m freestyle
  .. ex:: Sandy said "The Age" reported that Jodie won the 100m freestyle
  .. ex::  I think Sandy said "The Age" reported that Jodie won the 100m freestyle

|nopar|
This nesting is explained in terms of :dt:`recursion`. A grammar is
said to be :dt:`recursive` if a category occurring on the lefthand
side of a production (such as `s`:gc: in this case) also appears on
the righthand side of a production. If this dual occurrence takes
place in *one and the same production*, then we have :dt:`direct
recursion`; otherwise we have :dt:`indirect recursion`. There is no
recursion in grammar1_. However, the grammar in grammar2_ illustrates both kinds of
recursive production:

.. _grammar2:
.. ex::
     .. parsed-literal::

        S  |rarr| NP VP
        NP |rarr| Det Nom | Det Nom PP | PropN
        Nom |rarr| Adj Nom | N
        VP |rarr| V | V NP | V NP PP | V S
        PP |rarr| P NP

        PropN |rarr| 'John' | 'Mary' 
        Det |rarr| 'the' | 'a'
        N |rarr| 'man' | 'woman' | 'park' | 'dog' | 'lead' | 'telescope' | 'butterfly'
        Adj  |rarr| 'fierce' | 'black' |  'big' | 'European'
        V |rarr| 'saw' | 'chased' | 'barked'  | 'disappeared' | 'said' | 'reported' 
        P |rarr| 'in' | 'with' 

Notice that the production `Nom`:gc: |rarr| `Adj Nom`:gc: (where
`Nom`:gc: is the category of nominals) involves direct
recursion on the category `Nom`:gc:, whereas indirect recursion on `s`:gc:
arises from the combination of two productions, namely `s`:gc: |rarr|
`np vp`:gc: and `vp`:gc: |rarr| `v s`:gc:.  

To see how recursion is handled in this grammar, consider the following
trees.  Example nested-nominals_ involves nested nominal phrases,
while nested-sentences_ contains nested sentences.

.. ex::
  .. ex::
    .. _nested-nominals:
    .. tree::
     (S (NP (Det a) (Nom (Adj fierce)(Nom (Adj black) (N dog))))
         (VP (V chased)
            (NP (Det the) (Nom (Adj big)(Nom (Adj European) (N butterfly))))))
  .. ex::
    .. _nested-sentences:
    .. tree::
      (S (NP (Det the) (N man))
         (VP (V said)
             (S (NP (Det the) (N woman))
                (VP (V thought)
                    (S (NP (Det the) (N dog))
                             (VP (V barked)))))))
   

If you did the exercises for the last section, you will have noticed
that the recursive descent parser fails to deal properly with the
following production: `np`:gc: |rarr| `np pp`:gc:.
From a linguistic point of view, this production is perfectly respectable,
and will allow us to derive trees like this:

.. ex::
  .. tree::
    (S (NP 
           (NP 
               (NP (Det the) (N man))
               (PP (P with) (NP  (Det a) (N dog))))
            (PP (P in  (NP  (Det the) (N park)))))
         (VP (V disappeared)))

More schematically, the trees for these compound noun phrases will be
of the following shape:

.. _leftrec:
.. ex::
  .. tree::
    (NP (NP (NP (NP Det N) PP) PP) PP)

The structure in leftrec_ is called a `left recursive`:dt: structure.
These occur frequently in analyses of English, and
the failure of recursive descent parsers to deal adequately with left
recursion means that we will need to find alternative approaches.

Heads, Complements and Modifiers
--------------------------------

Let us take a closer look at verbs.
The grammar grammar2_ correctly generates examples like subcat1_,
corresponding to the four productions with `vp`:gc: on the lefthand side:

.. _subcat1:
.. ex::
   .. ex:: The woman gave the telescope to the dog
   .. ex:: The woman saw a man
   .. ex:: A man said that the woman disappeared
   .. ex:: The dog barked

That is, `gave`:lx: can occur with a following `np`:gc: and `pp`:gc:; 
`saw`:lx: can occur with a following `np`:gc:; 
`said`:lx: can occur with a following `s`:gc:; 
and `barked`:lx: can occur with no following phrase.
In these cases, `np`:gc:, `pp`:gc: and `s`:gc: are called :dt:`complements`
of the respective verbs, and the verbs themselves are called
:dt:`heads` of the verb phrase.

However, there are fairly strong constraints on what verbs can occur
with what complements. Thus, we would like our grammars to mark the
following examples as ungrammatical [#]_:

.. _subcat2:
.. ex:: 
   .. ex:: \*The woman disappeared the telescope to the dog
   .. ex:: \*The dog barked a man
   .. ex:: \*A man gave that the woman disappeared
   .. ex:: \*A man said

.. [#] It should be borne in mind that it is possible to create
   examples which involve 'non-standard' but interpretable
   combinations of verbs and complements. Thus, we can, at a stretch,
   interpret `the man disappeared the dog`:lx: as meaning that the man
   made the dog disappear. We will ignore such examples here.

How can we ensure that our grammar correctly excludes the
ungrammatical examples in subcat2_?  We need some way of constraining
grammar productions which expand `vp`:gc: so that verbs *only* cooccur
with their correct complements. We do this by dividing the class of
verbs into `subcategories`:dt:, each of which is associated with a
different set of complements. For example, `transitive verbs`:dt: such
as `saw`:lx:, `kissed`:lx: and `hit`:lx: require a following `np`:gc:
object complement. Borrowing from the terminology of chemistry, we
sometimes refer to the `valency`:dt: of a verb, that is, its capacity
to combine with a sequence of arguments and thereby compose a verb
phrase.

Let's introduce a new category label for such verbs, namely
`tv`:gc: (for Transitive Verb), and use it in the following productions:

.. ex::
   .. parsed-literal::

     `vp`:gc: |rarr| `tv np`:gc:
     `tv`:gc: |rarr| 'saw' | 'kissed' | 'hit'

Now `*the dog barked the man`:lx: is excluded since we haven't listed
`barked`:lx: as a `V_tr`:gc:, but `the woman saw a man`:lx: is still allowed.
Table verbcat_ provides more examples of labels for verb subcategories.

.. table:: verbcat

   ======    ====================    ========================
   Symbol    Meaning                 Example
   ======    ====================    ========================
   IV        intransitive verb       *barked*
   TV        transitive verb         *saw a man*
   DatV      dative verb             *gave a dog to a man*
   SV        sentential verb         *said that a dog barked*
   ======    ====================    ========================

   Verb Subcategories

The revised grammar for `vp`:gc: will now look like this:

.. _subcat3:
.. ex::
   .. parsed-literal:: 

      `vp`:gc: |rarr| `datv np pp`:gc:
      `vp`:gc: |rarr| `tv np`:gc:
      `vp`:gc: |rarr| `sv s`:gc:
      `vp`:gc: |rarr| `iv`:gc: 

      `datv`:gc: |rarr| 'gave' | 'donated' | 'presented'
      `tv`:gc: |rarr| 'saw' | 'kissed' | 'hit' | 'sang'
      `sv`:gc: |rarr| 'said' | 'knew' | 'alleged'
      `iv`:gc: |rarr| 'barked' | 'disappeared' | 'elapsed' | 'sang'

Notice that according to subcat3_, a given lexical item can belong to more
than one subcategory. For example, `sang`:lx: can occur both with and
without a following `np`:gc: complement.

Dependency Grammar
------------------

Although we concentrate on phrase structure grammars in this chapter,
we should mention an alternative approach, namely `dependency
grammar`:dt:. Rather than taking starting from the grouping of words
into constituents, dependency grammar takes as basic the notion that
one word can be dependent on another (namely, its head). The head of a
sentence is usually taken to be the main verb, and every other word is
either dependent on this head, or connects to it through a path of
dependencies. Figure depgraph0_ illustrates a dependency graph, where
the head of the arrow points to the head of a dependency. 

.. _depgraph0:
.. ex::
    .. image:: ../images/depgraph0.png
       :scale: 30

As you will see, the arcs in Figure depgraph0_ are labeled with the
particular dependency relation that holds between a dependent and its
head. For example, `Esso`:lx: bears the subject relation to `said`:lx:
(which is the head of the whole sentence), and `Tuesday`:lx: bears a
verbal modifier (`vmod`:gc:) relation to `started`:lx:.

An alternative way of representing the dependency relationships is illustrated
in the tree depgraph1_,
where dependents are shown as daughters of their heads.

.. _depgraph1:
.. ex::
    .. tree:: (said Esso (started (field the Whiting) production Tuesday))

One format for encoding dependency information places each word on a
line, followed by its part-of-speech tag, the index of its head, and
the label of the dependency relation (cf. [Nivre2006MP]_). The index
of a word is implicitly given by the ordering of the lines (with 1 as the
first index). This is illustrated in the following code snippet:

    >>> from nltk_lite.contrib.dependency import DepGraph
    >>> dg = DepGraph().read("""Esso	NNP	2	SUB
    ... said	VBD	0	ROOT
    ... the 	DT	5	NMOD
    ... Whiting	NNP	5	NMOD
    ... field	NN	6	SUB
    ... started	VBD	2	VMOD
    ... production	NN	6	OBJ
    ... Tuesday	NNP	6	VMOD""")

As you will see, this format also adopts the convention that the head
of the sentence is dependent on an empty node, indexed as 0. We can
use the ``deptree()`` method of a ``DepGraph()`` object to build an |NLTK|
tree like that illustrated earlier in depgraph1_.

    >>> tree = dg.deptree()
    >>> tree.draw()

Formalizing Context Free Grammars
---------------------------------

We have seen that a CFG contains terminal and nonterminal symbols, and
productions which dictate how constituents are expanded into other
constituents and words. In this section, we provide some formal definitions.

A CFG is a 4-tuple |langle|\ `N`:math:, |Sigma|, `P`:math:, `S`:math:\
|rangle|\ , where:

*  |Sigma| is a set of *terminal* symbols (e.g., lexical items);

*  `N`:math: is a set of *non-terminal* symbols (the category labels); 

*  `P`:math: is a set of *productions* of the form `A`:math: |rarr| |alpha|, where

   * `A`:math: is a non-terminal, and 
   
   * |alpha| is a string of symbols from (`N`:math: |cup| |Sigma|)*
     (i.e., strings of either terminals or non-terminals);

*  `S`:math: is the *start symbol*.

A `derivation`:dt: of a string from a non-terminal `A`:math:
in grammar `G`:math: is the result of successively applying
productions from  `G`:math: to `A`:math:.  For example, deriv1_ is a
derivation of `the dog with a telescope`:lx: for the grammar in grammar1_.

.. _deriv1:
.. ex::
  ::

     NP
     Det N PP
     the N PP
     the dog PP
     the dog P NP
     the dog with NP
     the dog with Det N
     the dog with a N
     the dog with a telescope

Although we have chosen here to expand the leftmost non-terminal
symbol at each stage, this is not obligatory; productions can be
applied in any order. Thus,  derivation deriv1_
could equally have started off in the following manner:

.. _deriv2:
.. ex::
  ::

     NP
     Det N PP
     Det N P NP
     Det N with NP
     ...

We can also write derivation deriv1_ as:

.. _deriv3:
.. ex::
   `np`:gc: |DoubleRightArrow| `det n pp`:gc:
   |DoubleRightArrow| `the`:lx: `n pp`:gc:
   |DoubleRightArrow| `the dog`:lx: `pp`:gc:
   |DoubleRightArrow| `the dog`:lx: `p np`:gc:
   |DoubleRightArrow| `the dog with`:lx: `np`:gc:
   |DoubleRightArrow| `the dog with a`:lx: `n`:gc:
   |DoubleRightArrow| `the dog with a telescope`:lx:

where |DoubleRightArrow| means "derives in one step". 
We use |DoubleRightArrow|\ * to mean "derives in zero or more steps":

* |alpha| |DoubleRightArrow|\ * |alpha| for any string |alpha|, and

* if |alpha| |DoubleRightArrow|\ * |beta| and |beta|
  |DoubleRightArrow| |gamma|, then |alpha| |DoubleRightArrow|\ *
  |gamma|.

We write `A`:math: |DoubleRightArrow|\ * |alpha| to indicate that
|alpha| can be derived from  `A`:math:.

In NLTK, context free grammars are defined in the ``parse.cfg`` module.  The easiest
way to construct a grammar object is from the standard string representation
of grammars.  In Listing cfg_ we define a grammar and use it to parse a simple
sentence.  You will learn more about parsing in the next section.

.. pylisting:: cfg
   :caption: Context Free Grammars in NLTK

    from nltk_lite import parse
    grammar = parse.cfg.parse_cfg("""
      S -> NP VP
      VP -> V NP | V NP PP
      V -> "saw" | "ate"
      NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
      Det -> "a" | "an" | "the" | "my"
      N -> "dog" | "cat" | "cookie" | "park"
      PP -> P NP
      P -> "in" | "on" | "by" | "with"
      """)
    >>> from nltk_lite import tokenize
    >>> sent = list(tokenize.whitespace("Mary saw Bob"))
    >>> rd_parser = parse.RecursiveDescent(grammar)
    >>> for p in rd_parser.get_parse_list(sent):
    ...      print p
    (S: (NP: 'Mary') (VP: (V: 'saw') (NP: 'Bob')))
  

Exercises
---------

1. |easy| In the recursive descent parser demo, experiment with changing the
   sentence to be parsed by selecting *Edit Text* in the *Edit* menu.

#. |easy| Can the grammar in grammar1_ be used to describe sentences which are
   more than 20 words in length?

#. |soso| You can modify the grammar in the recursive descent parser demo
   by selecting *Edit Grammar*  in the *Edit* menu. Change
   the first expansion production, namely ``NP -> Det N PP``, to ``NP -> NP
   PP``. Using the *Step* button, try to build a parse tree. What happens?

#. |soso| Extend the grammar in grammar2_ with productions which expand prepositions as
   intransitive, transitive and requiring a `pp`:gc:
   complement. Based on these productions, use the method of the
   preceding exercise to draw
   a tree for the sentence `Lee ran away home`:lx:\.

#. |soso| Pick some common verbs and complete the following tasks:

   a) Write a program to find those verbs in the PP Attachment Corpus
      included with NLTK.  Find any cases where the same verb
      exhibits two different attachments, but where the first noun,
      or second noun, or preposition, stay unchanged (as we saw in
      the PP Attachment Corpus example data above).

   b) Devise CFG grammar productions to cover some of these cases.

#. |hard| **Lexical Acquisition:**
   As we saw in Chapter chap-chunk_, it is possible 
   to collapse chunks down to their chunk label.  When we do this
   for sentences involving the word `gave`:lx:, we find patterns
   such as the following::

      gave NP
      gave up NP in NP
      gave NP up
      gave NP NP
      gave NP to NP
     
   a) Use this method to study the complementation patterns of a verb
      of interest, and write suitable grammar productions.

   b) Identify some English verbs that are near-synonyms, such as the
      :lx:`dumped/filled/loaded` example from earlier in this chapter.
      Use the chunking method to study the complementation patterns of
      these verbs.  Create a grammar to cover these cases.  Can the verbs
      be freely substituted for each other, or are their constraints?
      Discuss your findings.

-------
Parsing
-------

A :dt:`parser` processes input sentences according to the
productions of a grammar, and builds one or more
constituent structures which conform to the grammar.
A grammar is a declarative specification of well-formedness.
In NLTK, it is just a multi-line string; it is not itself a
program that can be used for anything.
A parser is a procedural interpretation of the grammar.
It searches through the space of trees licensed by a grammar
to find one that has the required sentence along its fringe.

Parsing is important in both linguistics and natural language processing.
A parser permits a grammar to be evaluated against
a potentially large collection of test sentences,
helping linguists to find any problems in their grammatical analysis.
A parser can serve as a model of psycholinguistic processing,
helping to explain the difficulties that humans have with processing
certain syntactic constructions.
Many natural language applications involve parsing at some point;
for example, we would expect the natural language questions
submitted to a question-answering system to undergo parsing as an initial step.

In this section we see two simple parsing algorithms,
a top-down method called recursive descent parsing,
and a bottom-up method called shift-reduce parsing.

.. _sec-recursive-descent:

Recursive Descent Parsing 
-------------------------

The simplest kind of parser interprets a grammar as a specification
of how to break a high-level goal into several lower-level subgoals.
The top-level goal is to find an `s`:gc:.  The `s`:gc: |rarr| `np vp`:gc:
production permits the parser to replace this goal with two subgoals:
find an `np`:gc:, then find a `vp`:gc:.  Each of these subgoals can be
replaced in turn by sub-sub-goals, using productions that have `np`:gc:
and `vp`:gc: on their left-hand side.  Eventually, this expansion
process leads to subgoals such as: find the word `telescope`:lx:.  Such
subgoals can be directly compared against the input string, and
succeed if the next word is matched.  If there is no match the parser
must back up and try a different alternative.

The recursive descent parser builds a parse tree during the above
process.  With the initial goal (find an `s`:gc:), the `s`:gc: root node
is created.  As the above process recursively expands its goals using
the productions of the grammar, the parse tree is extended downwards
(hence the name *recursive descent*).  We can see this in action using
the parser demonstration ``nltk_lite.draw.rdparser.demo()``.
Six stages of the execution of this parser are shown in Table rdparser_.

.. table:: rdparser

   +---------------------------+--------------------------+---------------------------+
   | |rdparser1|               | |rdparser2|              | |rdparser3|               |
   |                           |                          |                           |
   | a. Initial stage          | b. 2nd production        | c. Matching `the`:lx:     |
   +---------------------------+--------------------------+---------------------------+
   | |rdparser4|               | |rdparser5|              | |rdparser6|               |
   |                           |                          |                           |
   | d. Cannot match `man`:lx: | e. Completed parse       | f. Backtracking           |
   +---------------------------+--------------------------+---------------------------+

   Six Stages of a Recursive Descent Parser
   
.. |rdparser1| image:: ../images/rdparser1.png
.. |rdparser2| image:: ../images/rdparser2.png
.. |rdparser3| image:: ../images/rdparser3.png
.. |rdparser4| image:: ../images/rdparser4.png
.. |rdparser5| image:: ../images/rdparser5.png
.. |rdparser6| image:: ../images/rdparser6.png

During this process, the parser is often forced to choose between several
possible productions.  For example, in going from step c to step d, it
tries to find productions with `n`:gc: on the left-hand side.  The
first of these is `n`:gc: |rarr| `man`:lx:.  When this does not work
it `backtracks`:idx:, and tries other `n`:gc: productions in order, under it
gets to `n`:gc: |rarr| `dog`:lx:, which matches the next word in the
input sentence.  Much later, as shown in step e, it finds a complete
parse.  This is a tree which covers the entire sentence, without any
dangling edges.  Once a parse has been found, we can get the parser to
look for additional parses.  Again it will backtrack and explore other
choices of production in case any of them result in a parse.

NLTK provides a recursive descent parser:

    >>> from nltk_lite import parse
    >>> rd_parser = parse.RecursiveDescent(grammar)
    >>> sent = list(tokenize.whitespace('Mary saw a dog'))
    >>> rd_parser.get_parse_list(sent)
    [('S': ('NP': 'Mary') ('VP': ('V': 'saw') ('NP': ('Det': 'a') ('N': 'dog'))))]

.. Note:: ``parse.RecursiveDescent()`` takes an optional parameter ``trace``. 
   If ``trace`` is greater than zero, then the parser will report the steps
   that it takes as it parses a text.

Recursive descent parsing has three key shortcomings.  First,
left-recursive productions like `np`:gc: |rarr| `np pp`:gc: send it
into an infinite loop.  Second, the parser wastes a lot of time
considering words and structures that do not correspond to the input
sentence.  Third, the backtracking process may discard parsed
constituents that will need to be rebuilt again later.  For example,
backtracking over `vp`:gc: |rarr| `v np`:gc: will discard the subtree
created for the `np`:gc:.  If the parser then proceeds with `vp`:gc:
|rarr| `v np pp`:gc:, then the `np`:gc: subtree must be created all
over again.

Recursive descent parsing is a kind of `top-down parsing`:dt:.
Top-down parsers use a grammar to *predict* what the input will be,
before inspecting the input!  However, since the input is available to
the parser all along, it would be more sensible to consider the input
sentence from the very beginning.  This approach is called
`bottom-up parsing`:dt:, and we will see an example in the next section.

.. _sec-shift-reduce:

Shift-Reduce Parsing 
--------------------

A simple kind of bottom-up parser is the `shift-reduce parser`:dt:.
In common with all bottom-up parsers, a shift-reduce
parser tries to find sequences of words and phrases that correspond
to the *right-hand* side of a grammar production, and replace them
with the left-hand side, until the whole sentence is reduced to
an `s`:gc:.

The shift-reduce parser repeatedly pushes the next input word onto a
stack (Section stacks-and-queues_); this is the `shift`:dt: operation.
If the top *n* items on the stack match
the *n* items on the right-hand side of some production,
then they are all popped off the stack, and the item on the left-hand
side of the production is pushed on the stack.  This replacement of
the top *n* items with a single item is the `reduce`:dt: operation.
(This reduce operation may only be applied to the top of the stack;
reducing items lower in the stack must be done before later items are
pushed onto the stack.)  The parser finishes when all the input is
consumed and there is only one item remaining on the stack, a parse
tree with an `s`:gc: node as its root.

The shift-reduce parser builds a parse tree during the above process.
If the top of stack holds the word `dog`:lx:, and if the grammar has a
production `n`:gc: |rarr| `dog`:lx:, then the reduce operation causes the word
to be replaced with the parse tree for this production.  For
convenience we will represent this tree as ``N(dog)``.  At a later
stage, if the top of the stack holds two items ``Det(the) N(dog)`` and
if the grammar has a production `np`:gc: |rarr| `det n`:gc: then the reduce
operation causes these two items to be replaced with ``NP(Det(the),
N(dog))``.  This process continues until a parse tree for the entire
sentence has been constructed.  We can see this in action using the
parser demonstration ``nltk_lite.draw.srparser.demo()``.
Six stages of the execution of this parser are shown in Figure srparser_.

.. table:: srparser

   +------------------------------------+------------------------------------+
   | .. image:: ../images/srparser1.png | .. image:: ../images/srparser2.png |
   |                                    |                                    |
   | a. Initial State                   | b. After one shift                 |
   +------------------------------------+------------------------------------+
   | .. image:: ../images/srparser3.png | .. image:: ../images/srparser4.png |
   |                                    |                                    |
   | c. After reduce shift reduce       | d. After recognizing the second NP |
   +------------------------------------+------------------------------------+
   | .. image:: ../images/srparser5.png | .. image:: ../images/srparser6.png |
   |                                    |                                    |
   | e. Complex NP                      | f. Final Step                      |
   +------------------------------------+------------------------------------+

   Six Stages of a Shift-Reduce Parser

NLTK provides ``parse.ShiftReduce()``, a simple
implementation of a shift-reduce parser.  This parser does not
implement any backtracking, so it is not guaranteed to find a parse
for a text, even if one exists.  Furthermore, it will only find at
most one parse, even if more parses exist.  We can provide an
optional ``trace`` parameter, which controls how verbosely the
parser reports the steps that it takes as it parses a text: 

    >>> sr_parse = parse.ShiftReduce(grammar, trace=2)
    >>> sent = list(tokenize.whitespace('Mary saw a dog'))
    >>> sr_parse.parse(sent)
    Parsing 'Mary saw a dog'
        [ * Mary saw a dog]
      S [ 'Mary' * saw a dog]
      R [ <NP> * saw a dog]
      S [ <NP> 'saw' * a dog]
      R [ <NP> <V> * a dog]
      S [ <NP> <V> 'a' * dog]
      R [ <NP> <V> <Det> * dog]
      S [ <NP> <V> <Det> 'dog' * ]
      R [ <NP> <V> <Det> <N> * ]
      R [ <NP> <V> <NP> * ]
      R [ <NP> <VP> * ]
      R [ <S> * ]
      ('S': ('NP': 'Mary') ('VP': ('V': 'saw') ('NP': ('Det': 'a') ('N': 'dog'))))

Shift-reduce parsers have a number of problems.
A shift-reduce parser may fail to parse the sentence, even though the
sentence is well-formed according to the grammar.  In such cases,
there are no remaining input words to shift, and there is no way to
reduce the remaining items on the stack, as exemplified in Table conflict_\ a.
The parser entered this blind alley at an earlier
stage shown in Table conflict_\ b, when it reduced instead of
shifted.  This situation is called a `shift-reduce conflict`:dt:.  At
another possible stage of processing shown in Table conflict_\ c,
the parser must choose between two possible reductions, both matching
the top items on the stack: `vp`:gc: |rarr| `vp np pp`:gc: or `np`:gc: |rarr|
`np pp`:gc:.  This situation is called a `reduce-reduce conflict`:dt:.

.. table:: conflict

   +------------------------------------------+
   | .. image:: ../images/srparser7.png       |
   |                                          |
   | a. Dead end                              |
   +------------------------------------------+
   | .. image:: ../images/srparser8.png       |
   |                                          |
   | b. Shift-reduce conflict                 |
   +------------------------------------------+
   | .. image:: ../images/srparser9.png       |
   |                                          |
   | c. Reduce-reduce conflict                |
   +------------------------------------------+

   Conflict in Shift-Reduce Parsing

.. To do: diagram showing search tree with success and failure.

Shift-reduce parsers may implement policies for resolving such
conflicts.  For example, they may address shift-reduce conflicts by
shifting only when no reductions are possible, and they may address
reduce-reduce conflicts by favouring the reduction operation that removes
the most items from the stack.  No such policies are failsafe however.

The advantages of shift-reduce parsers over recursive descent parsers
is that they only build structure that corresponds to the words in the
input.  Furthermore, they only build each sub-structure once,
e.g. ``NP(Det(the), N(man))`` is only built and pushed onto the stack
a single time, regardless of whether it will later be used by the `vp`:gc:
|rarr| `v np pp`:gc: reduction or the `np`:gc: |rarr| `np pp`:gc: reduction.

The Left-Corner Parser
----------------------

One of the problems with the recursive descent parser is that it can
get into an infinite loop.  This is because it applies the grammar
productions blindly, without considering the actual input sentence.
A left-corner parser is a hybrid between the bottom-up and top-down
approaches we have seen.

Grammar grammar2_ allows us to produce the following parse of `John saw
Mary`:lx:\ :

.. _jmtree:
.. ex::
  .. tree::
   (S (NP John) 
      (VP (V saw)
         (NP Mary)))

Recall that the grammar in grammar2_ has the following productions for expanding `np`:gc:\ :

.. ex::
   .. _r1:
   .. ex:: `np`:gc: |rarr| `dt nom`:gc:
   .. _r2:
   .. ex:: `np`:gc: |rarr| `dt nom pp`:gc:
   .. _r3:
   .. ex:: `np`:gc: |rarr| `propn`:gc: 

Suppose we ask you to first look at tree jmtree_, and then decide
which of the `np`:gc: productions you'd want a recursive descent parser to
apply first |mdash| obviously, r3_ is the right choice! How do you
know that it would be pointless to apply r1_ or r2_ instead? Because
neither of these productions will derive a string whose first word is
`John`:lx:.  That is, we can easily tell that in a successful
parse of `John saw Mary`:lx:, the parser has to expand `np`:gc: in
such a way that `np`:gc: derives the string `John`:lx: |alpha|. More
generally, we say that a category `B`:math: is a `left-corner`:dt: of
a tree rooted in `A`:math: if  `A`:math: |DoubleRightArrow|\ *
`B`:math: |alpha|.

.. ex::
  .. tree:: <A B a>

A `left-corner parser`:dt: is a top-down parser with bottom-up filtering.
Unlike an ordinary recursive descent parser, it does not get trapped
in left recursive productions. 
Before starting its work, a left-corner parser preprocesses the
context-free grammar to build a table where each row contains two
cells, the first holding a non-terminal, and the second holding the
collection of possible left corners of that non-terminal. Table lc_
illustrates this for the grammar from grammar2_.

.. table:: lc
 
   ========  ============================
   Category  Left-Corners (pre-terminals)
   ========  ============================
   S         NP
   NP        Det, PropN
   VP        V
   PP        P
   ========  ============================

   Left-Corners in grammar2_
 
Each time a production is considered by the parser, it checks that the
next input word is compatible with at least one of the pre-terminal
categories in the left-corner table.

[TODO: *explain how this effects the action of the parser, and why this solves the problem.*]

Exercises
---------

#. |easy| With pen and paper, manually trace the execution of a recursive descent
   parser and a shift-reduce parser, for a CFG you have already seen, or one
   of your own devising.

#. |soso| Compare the performance of the top-down, bottom-up, and left-corner
   parsers using the same grammar and three grammatical test
   sentences. Use ``timeit`` to log the amount of time each
   parser takes on the same sentence (Section sec-timing_).  Write a function which runs all
   three parsers on all three sentences, and prints a 3-by-3 grid of
   times, as well as row and column totals. Discuss your findings.

#. |soso| Read up on "garden path" sentences.  How might the computational
   work of a parser relate to the difficulty humans have with
   processing these sentences?
   ``http://en.wikipedia.org/wiki/Garden_path_sentence``

#. |hard| **Left-corner parser:** Develop a left-corner parser
   based on the recursive descent parser, and inheriting from ``ParseI``.
   (Note, this exercise requires knowledge of Python classes, covered
   in Chapter chap-advanced-programming_.)

#. |hard| Extend NLTK's shift-reduce parser to incorporate backtracking, so
   that it is guaranteed to find all parses that exist (i.e. it is `complete`:dt:).

----------
Conclusion
----------

We began this chapter talking about confusing encounters with grammar
at school.  We just wrote what we wanted to say, and our work was
handed back with red marks showing all our grammar mistakes.  If this
kind of "grammar" seems like secret knowledge, the linguistic approach
we have taken in this chapter is quite the opposite: grammatical
structures are made explicit as we build trees on top of sentences.
We can write down the grammar productions, and parsers can build the
trees automatically.  This thoroughly objective approach
is widely referred to as `generative grammar`:dt:.

Note that we have only considered "toy grammars," small grammars that
illustrate the key aspects of parsing.  But there is an obvious
question as to whether the general approach can be scaled up to cover
large corpora of natural languages. How hard would it be to construct
such a set of productions by hand? In general, the answer is: *very
hard*. Even if we allow ourselves to use various formal devices that
give much more succinct representations of grammar productions (some
of which will be discussed in Chapter chap-advanced-parsing_), it is still extremely
difficult to keep control of the complex interactions between the many
productions required to cover the major constructions of a
language. In other words, it is hard to modularize grammars so that
one portion can be developed independently of the other parts. This in
turn means that it is difficult to distribute the task of grammar
writing across a team of linguists. Another difficulty is that as the
grammar expands to cover a wider and wider range of constructions,
there is a corresponding increase in the number of analyses which are
admitted for any one sentence. In other words, ambiguity increases
with coverage.

Despite these problems, there are a number of large collaborative
projects which have achieved interesting and impressive results in
developing rule-based grammars for several languages. Examples are the
Lexical Functional Grammar (LFG) Pargram project
(http://www2.parc.com/istl/groups/nltt/pargram/), the Head-Driven
Phrase Structure Grammar (HPSG) LinGO Matrix framework
(http://www.delph-in.net/matrix/), and the Lexicalized Tree Adjoining
Grammar XTAG Project (http://www.cis.upenn.edu/~xtag/).

---------------
Summary (notes)
---------------

* Sentences have internal organization, or constituent structure,
  which can be represented using a tree; notable features of constituent
  structure are: recursion, heads, complements, modifiers

* A grammar is a compact characterization of a potentially infinite set of sentences;
  we say that a tree is well-formed according to a grammar, or that a grammar licenses a tree.

* Syntactic ambiguity arises when one sentence has more than one syntactic structure
  (e.g. prepositional phrase attachment ambiguity).

* A parser is a procedure for finding one or more trees corresponding to a grammatically
  well-formed sentence.

* A simple top-down parser is the recursive descent parser (summary, problems)

* A simple bottom-up parser is the shift-reduce parser (summary, problems)

* It is difficult to develop a broad-coverage grammar...

---------------
Further Reading
---------------

There are many introductory books on syntax. [O'Grady1989LI]_ is a
general introduction to linguistics, while [Radford1988TG]_ provides a
gentle introduction to transformational grammar, and can be
recommended for its coverage of transformational approaches to
unbounded dependency constructions. 

[BurtonRoberts1997AS]_ is very practically oriented textbook on how to
analyse constituency in English, with extensive exemplification and
exercises. [Huddleston2002CGE]_ provides an up-to-date and comprehensive analysis of
syntactic phenomena in English.


* LALR(1)

* Marcus parser

* Lexical Functional Grammar (LFG) 

  * `Pargram project <http://www2.parc.com/istl/groups/nltt/pargram/>`_

  * `LFG Portal <http://www.essex.ac.uk/linguistics/lfg/>`_

* Head-Driven Phrase Structure Grammar (HPSG) `LinGO Matrix framework <http://www.delph-in.net/matrix/>`_

* Lexicalized Tree Adjoining Grammar `XTAG Project <http://www.cis.upenn.edu/~xtag/>`_

.. include:: footer.txt


