.. -*- mode: rst -*-
.. include:: ../definitions.txt

.. _engineering:

========================
12. Language Engineering
========================

This chapter will cover evaluation, trade-offs between methods that
create large vs small models (e.g. n-gram tagging vs Brill tagging).

.. TODO: professional issues, ethics, reliability
.. TODO: spell checking example	
.. TODO: confusion matrix
.. TODO: lexical chaining for text segmentation, or WSD
.. TODO: introduce terminology
   - knowledge engineering approach (rule-based; hand-crafted)
   - machine learning approach (learn rules from annotated corpora)
   - hybrid: rules <-> features
   - refer back to introduction
.. TODO: architectures: pipeline/cascade vs blackboard

---------------------
Problems with Tagging
---------------------

* what context is relevant?
* how best to combine taggers?
* sensitivity to lexical identity?
* ngram tagging produces large models, uninterpretable
  cf Brill tagging, which has smaller, linguistically-interpretable models

.. Discussion of how statistical methods have been used to
   reach a linguistically interpretable, symbolic result.
   Contrast between n-gram and Brill tagger about whether
   we can learn anything from inspecting the model itself
   (n-gram data vs transformational rules).

Exercises
---------

#. |hard| **Tagger context**:

   N-gram taggers choose a tag for a token based on its text and the
   tags of the *n-1* preceding tokens. This is a common context to use
   for tagging, but certainly not the only possible context.

   a) Construct a new tagger, sub-classed from ``SequentialTagger``, that
      uses a different context. If your tagger's context contains
      multiple elements, then you should combine them in a
      tuple. Some possibilities for elements to include are: (i)
      the current word or a previous word; (ii) the length
      of the current word text or of the previous word; (iii)
      the first letter of the current word or the previous word;
      or (iv) the previous tag.  Try to choose
      context elements that you believe will help the tagger decide which
      tag is appropriate. Keep in mind the trade-off between more
      specific taggers with accurate results; and more general taggers
      with broader coverage.  Combine your tagger with other taggers
      using the backoff method.

   #) How does the combined tagger's accuracy compare to the basic tagger? 

   #) How does the combined tagger's accuracy compare to the combined taggers you
      created in the previous exercise? 

#. |hard| **Reverse sequential taggers**
   Since sequential taggers tag tokens in order, one at a time, they
   can only use the predicted tags to the *left* of the current token
   to decide what tag to assign to a token. But in some cases, the
   *right* context may provide more useful information than the left
   context.  A reverse sequential tagger starts with the last word of the
   sentence and, proceeding in right-to-left order, assigns tags to words
   on the basis of the tags it has already predicted to the right.
   By reversing texts at appropriate times, we can use NLTK's existing
   sequential tagging classes to perform reverse sequential
   tagging: reverse the training text before training the tagger;
   and reverse the text being tagged both before and after.

   a) Use this technique to create a bigram reverse sequential tagger.

   #) Measure its accuracy on a tagged section of the Brown corpus. Be
      sure to use a different section of the corpus for testing than you
      used for training.

   #) How does its accuracy compare to a left-to-right bigram
      tagger, using the same training data and test data?
        
#. |hard| **Alternatives to backoff**:
   Create a new kind of tagger that combines several taggers using
   a new mechanism other than backoff (e.g. voting).  For robustness
   in the face of unknown words, include a regexp tagger, a unigram
   tagger that removes a small number of prefix or suffix characters
   until it recognizes a word, or an n-gram tagger that does not
   consider the text of the token being tagged.
        


------------------
Evaluating Taggers
------------------

As we experiment with different taggers, it is important to have an
objective performance measure.  Fortunately, we already have manually
verified training data (the original tagged corpus), so we can use
that to score the accuracy of a tagger, and to perform systematic error analysis.

Scoring Accuracy
----------------

Consider the sentence from the Brown Corpus in Table evaluating-taggers_.
The 'Gold Standard' tags from the corpus are given in the second column, while
the tags assigned by a unigram tagger appear in the third column.
Two mistakes made by the unigram tagger are italicized.

.. table:: evaluating-taggers

   ===============  =============  ==============
   Sentence         Gold Standard  Unigram Tagger
   ===============  =============  ==============
   The              at             at
   President        nn-tl          nn-tl
   said             vbd            vbd
   he               pps            pps
   will             md             md
   ask              vb             vb
   Congress         np             np
   to               to             to
   increase         vb             *nn*
   grants           nns            nns
   to               in             *to*
   states           nns            nns
   for              in             in
   vocational       jj             jj
   rehabilitation   nn             nn
   .                .              .
   ===============  =============  ==============

   Evaluating Taggers
 
The tagger correctly tagged 14 out of 16 words, so it gets a score of
14/16, or 87.5%.  Of course, accuracy should be judged on the basis of
a larger sample of data.  NLTK provides a function called
``tag.accuracy`` to automate the task.  In the simplest case, we
can test the tagger using the same data it was trained on:

    >>> from nltk_lite import tag
    >>> from nltk_lite.corpora import brown
    >>> from itertools import islice
    >>> train_sents = list(islice(brown.tagged(), 500))  # sents 0..499
    >>> unigram_tagger = tag.Unigram()
    >>> unigram_tagger.train(train_sents)
    >>> acc = tag.accuracy(unigram_tagger, train_sents)
    >>> print 'Accuracy = %4.1f%%' % (100 * acc)
    Accuracy = 79.6%

However, testing a language processing system over its training data is unwise.
A system which simply memorized the training data would get a perfect
score without doing any linguistic modeling.  Instead, we would like
to reward systems that make good generalizations, so we should test
against *unseen data*, and replace ``train_sents`` above with
``unseen_sents``.  We can then define the two sets of data as follows:

    >>> train_sents  = list(brown.tagged('a'))[:500]
    >>> unseen_sents = list(brown.tagged('a'))[500:600] # sents 500-599

Now we train the tagger using ``train_sents`` and evaluate it using
``unseen_sents``, as follows:

    >>> unigram_tagger = tag.Unigram(backoff=tag.Default('nn'))
    >>> unigram_tagger.train(train_sents)
    >>> acc = tag.accuracy(unigram_tagger, unseen_sents)
    >>> print 'Accuracy = %4.1f%%' % (100 * acc)
    Accuracy = 74.6%

The accuracy scores produced by this evaluation method are lower, but
they give a more realistic picture of the performance of the tagger.
Note that the performance of any statistical tagger is highly
dependent on the quality of its training set. In particular, if the
training set is too small, it will not be able to reliably estimate
the most likely tag for each word. Performance will also suffer if the
training set is significantly different from the texts we wish to tag.

In the process of developing a tagger, we can use the accuracy score
as an objective measure of the improvements made to the system.
Initially, the accuracy score will go up quickly as we fix obvious
shortcomings of the tagger.  After a while, however, it becomes more
difficult and improvements are small.

Baseline Performance
--------------------

It is difficult to interpret an accuracy score in isolation.  For
example, is a person who scores 25% in a test likely to know a quarter
of the course material?  If the test is made up of 4-way multiple
choice questions, then this person has not performed any better than
chance.  Thus, it is clear that we should *interpret* an accuracy
score relative to a *baseline*.  The choice of baseline is somewhat
arbitrary, but it usually corresponds to minimal knowledge about the
domain.

In the case of tagging, a  possible baseline score can be found by
tagging every word with ``NN``, the most likely tag.

    >>> baseline_tagger = tag.Default('nn')
    >>> acc = tag.accuracy(baseline_tagger, brown.tagged('a'))
    >>> print 'Accuracy = %4.1f%%' % (100 * acc)
    Accuracy = 13.1%

Unfortunately this is not a very good baseline.  There are many
high-frequency words which are not nouns.  Instead we could use the standard
unigram tagger to get a baseline of 75%.  However, this does not seem
fully legitimate: the unigram's model covers all words seen during
training, which hardly seems like 'minimal knowledge'.  Instead, let's
only permit ourselves to store tags for the most frequent words.

The first step is to identify the most frequent words in the corpus,
and for each of these words, identify the most likely tag:

    >>> from nltk_lite.probability import *
    >>> wordcounts = FreqDist()
    >>> wordtags = ConditionalFreqDist()
    >>> for sent in brown.tagged('a'):
    ...     for (w,t) in sent:
    ...         wordcounts.inc(w)    # count the word
    ...         wordtags[w].inc(t)   # count the word's tag
    >>> frequent_words = wordcounts.sorted_samples()[:1000]

Now we can create a lookup table (a dictionary) which maps words to
likely tags, just for these high-frequency words.  We can then define
a new baseline tagger which uses this lookup table:

    >>> table = dict((word, wordtags[word].max()) for word in frequent_words)
    >>> baseline_tagger = tag.Lookup(table, tag.Default('nn'))
    >>> acc = tag.accuracy(baseline_tagger, brown.tagged('a'))
    >>> print 'Accuracy = %4.1f%%' % (100 * acc)
    Accuracy = 72.5%

This, then, would seem to be a reasonable baseline score for a
tagger.  When we build new taggers, we will only credit ourselves for
performance exceeding this baseline.

Error Analysis
--------------

While the accuracy score is certainly useful, it does not tell us how
to improve the tagger.  For this we need to undertake error analysis.
For instance, we could construct a *confusion matrix*, with a row and
a column for every possible tag, and entries that record how often a
word with tag *T*\ :subscript:`i` is incorrectly tagged as *T*\ :subscript:`j`
Another approach is to analyze the context of the errors, which is what
we do now.

Consider the following program, which catalogs all errors
along with the tag on the left and their frequency of occurrence.

    >>> errors = {}
    >>> for i in range(len(unseen_sents)):
    ...     raw_sent = tag.untag(unseen_sents[i])
    ...     test_sent = list(unigram_tagger.tag(raw_sent))
    ...     unseen_sent = unseen_sents[i]
    ...     for j in range(len(test_sent)):
    ...         if test_sent[j][1] != unseen_sent[j][1]:
    ...             test_context = test_sent[j-1:j+1]
    ...             gold_context = unseen_sent[j-1:j+1]
    ...             if None not in test_context:
    ...                 pair = (tuple(test_context), tuple(gold_context))
    ...                 if pair not in errors:
    ...                     errors[pair] = 0
    ...                 errors[pair] += 1

The ``errors`` dictionary has keys
of the form ``((t1,t2),(g1,g2))``, where ``(t1,t2)`` are the test
tags, and ``(g1,g2)`` are the gold-standard tags.  The values in the
``errors`` dictionary are simple counts of how often each error
occurred.  With some further processing, we construct the list
``counted_errors`` containing tuples consisting of counts and errors,
and then do a reverse sort to get the most significant errors first:

    >>> counted_errors = [(errors[k], k) for k in errors.keys()]
    >>> counted_errors.sort()
    >>> counted_errors.reverse()
    >>> for err in counted_errors[:5]:
    ...     print err
    (32, ((), ()))
    (5, ((('the', 'at'), ('Rev.', 'nn')),
         (('the', 'at'), ('Rev.', 'np'))))
    (5, ((('Assemblies', 'nn'), ('of', 'in')),
         (('Assemblies', 'nns-tl'), ('of', 'in-tl'))))
    (4, ((('of', 'in'), ('God', 'nn')),
         (('of', 'in-tl'), ('God', 'np-tl'))))
    (3, ((('to', 'to'), ('form', 'nn')),
         (('to', 'to'), ('form', 'vb'))))

The fifth line of output records the fact that there were 3 cases
where the unigram tagger mistakenly tagged a verb as a noun, following
the word `to`:lx:.  (We encountered the inverse of this mistake for the word
`increase`:lx: in the above evaluation table, where the unigram tagger tagged
`increase`:lx: as a verb instead of a noun since it occurred more often
in the training data as a verb.)  Here, when `form`:lx: appears
after the word `to`:lx:, it is invariably a verb.  Evidently, the performance
of the tagger would improve if it was modified to consider not just
the word being tagged, but also the tag of the word on the left.  Such
taggers are known as bigram taggers, and we consider them in the next section.

Exercises
---------

#. |soso| **Evaluating a Unigram Tagger**:
   Apply our evaluation methodology to the unigram tagger developed in
   the previous section.  Discuss your findings.
          

-------------------------
Sparse Data and Smoothing
-------------------------

[Introduction to NLTK's support for statistical estimation.]


----------------
The Brill Tagger
----------------

A potential issue with n-gram taggers is the size of their n-gram
table (or language model).  If tagging is to be employed in a variety
of language technologies deployed on mobile computing devices, it is
important to strike a balance between model size and tagger
performance.  An n-gram tagger with backoff may store trigram and
bigram tables, large sparse arrays which may have hundreds of millions
of entries.

A second issue concerns context.  The only information an n-gram
tagger considers from prior context is tags, even though words
themselves might be a useful source of information.  It is simply
impractical for n-gram models to be conditioned on the identities of
words in the context.  In this section we examine Brill tagging,
a statistical tagging method which performs very well using models
that are only a tiny fraction of the size of n-gram taggers.

Brill tagging is a kind of *transformation-based learning*.  The
general idea is very simple: guess the tag of each word, then go back
and fix the mistakes.  In this way, a Brill tagger successively
transforms a bad tagging of a text into a better one.  As with n-gram
tagging, this is a *supervised learning* method, since we need
annotated training data.  However, unlike n-gram tagging, it does
not count observations but compiles a list of transformational
correction rules.

The process of Brill tagging is usually explained by analogy with
painting.  Suppose we were painting a tree, with all its details of
boughs, branches, twigs and leaves, against a uniform sky-blue
background.  Instead of painting the tree first then trying to paint
blue in the gaps, it is simpler to paint the whole canvas blue, then
"correct" the tree section by over-painting the blue background.  In
the same fashion we might paint the trunk a uniform brown before going
back to over-paint further details with even finer brushes.  Brill
tagging uses the same idea: begin with broad brush strokes then fix up
the details, with successively finer changes.  Table brill-tagging_
illustrates this process, first tagging with the unigram tagger, then
fixing the errors.

.. table:: brill-tagging

   ==============  =======  ========  =================================  =============================
   Sentence:       Gold:    Unigram:  Replace ``nn`` with ``vb``         Replace ``to`` with ``in``
                                      when the previous word is ``to``   when the next tag is ``nns`` 
   ==============  =======  ========  =================================  =============================
   The             at       at
   President       nn-tl    nn-tl
   said            vbd      vbd
   he              pps      pps
   will            md       md
   ask             vb       vb
   Congress        np       np
   to              to       to
   increase        vb       *nn*      *vb*
   grants          nns      nns
   to              in       *to*      *to*                               *in*
   states          nns      nns
   for             in       in
   vocational      jj       jj
   rehabilitation  nn       nn
   ==============  =======  ========  =================================  =============================

   Steps in Brill Tagging
 
In this table we see two rules.  All such rules are generated from a
template of the following form: form "replace *T*\ :subscript:`1` with
*T*\ :subscript:`2` in the context *C*".  Typical contexts are the
identity or the tag of the preceding or following word, or the
appearance of a specific tag within 2-3 words of of the current word.  During
its training phase, the tagger guesses values for *T*\ :subscript:`1`,
*T*\ :subscript:`2` and *C*, to create thousands of candidate rules.
Each rule is scored according to its net benefit: the
number of incorrect tags that it corrects, less the number of correct
tags it incorrectly modifies.  This process is best illustrated by a
listing of the output from the NLTK Brill tagger (here run on tagged
Wall Street Journal text from the Penn Treebank).

::

  Loading tagged data...
  Training unigram tagger: [accuracy: 0.820940]
  Training Brill tagger on 37168 tokens...
 
  Iteration 1: 1482 errors; ranking 23989 rules;
    Found: "Replace POS with VBZ if the preceding word is tagged PRP"
    Apply: [changed 39 tags: 39 correct; 0 incorrect]
 
  Iteration 2: 1443 errors; ranking 23662 rules;
    Found: "Replace VBP with VB if one of the 3 preceding words is tagged MD"
    Apply: [changed 36 tags: 36 correct; 0 incorrect]
 
  Iteration 3: 1407 errors; ranking 23308 rules;
    Found: "Replace VBP with VB if the preceding word is tagged TO"
    Apply: [changed 24 tags: 23 correct; 1 incorrect]
 
  Iteration 4: 1384 errors; ranking 23057 rules;
    Found: "Replace NN with VB if the preceding word is to"
    Apply: [changed 67 tags: 22 correct; 45 incorrect]
    ...
  Iteration 20: 1138 errors; ranking 20717 rules;
    Found: "Replace RBR with JJR if one of the 2 following words is tagged NNS"
    Apply: [changed 14 tags: 10 correct; 4 incorrect]
 
  Iteration 21: 1128 errors; ranking 20569 rules;
    Found: "Replace VBD with VBN if the preceding word is tagged VBD"
  [insufficient improvement; stopping]
 
  Brill accuracy: 0.835145


Brill taggers have another interesting property: the rules are
linguistically interpretable.  Compare this with the n-gram taggers,
which employ a potentially massive table of n-grams.  We cannot learn
much from direct inspection of such a table, in comparison to the
rules learned by the Brill tagger.

.. TODO: saving a Brill tagger to a file, reloading

Exercises
---------

1. |easy| Try the Brill tagger demonstration, as follows::

    from nltk_lite.tag import brill
    brill.demo()

#. |soso| Consult the documentation for the demo function, using ``help(brill.demo)``.
   Experiment with the tagger by setting different values for the parameters.
   Is there any trade-off between training time (corpus size) and performance?

#. |hard| Inspect the diagnostic files created by the tagger ``rules.out`` and
   ``errors.out``.  Obtain the demonstration code (``nltk_lite/tag/brill.py``)
   and create your own version of the Brill tagger.

   a) Delete some of the rule templates, based on what you learned
      from inspecting ``rules.out``.

   b) Add some new rule templates which employ contexts that might help
      to correct the errors you saw in ``errors.out``.

(We are grateful to Christopher Maloof for developing NLTK's Brill
tagger, and Trevor Cohn for developing NLTK's HMM tagger.)

--------------
The HMM Tagger
--------------

[Overview of NLTK's HMM tagger.]

------------------------
Evaluating Chunk Parsers
------------------------

An easy way to evaluate a chunk parser is to take some already chunked
text, strip off the chunks, rechunk it, and compare the result with
the original chunked text.  The ``ChunkScore.score()`` function takes
the correctly chunked sentence as its first argument, and the newly
chunked version as its second argument, and compares them.  It reports
the fraction of actual chunks that were found (recall), the fraction
of hypothesized chunks that were correct (precision), and a combined
score, the F-measure (the harmonic mean of precision and recall).
    
A number of different metrics can be used to evaluate chunk parsers.
We will concentrate on a class of metrics that can be derived from two
sets:

* **guessed**: The set of chunks returned by the chunk parser. 
* **correct**: The correct set of chunks, as defined in the test corpus.

The evaluation method we will use comes from the field of information retrieval,
and considers the performance of a document retrieval system.  We will set up
an analogy between the correct set of chunks and a user's so-called "information need",
and between the set of returned chunks and a system's returned documents.  Consider
the following diagram.

.. figure:: ../images/precision-recall.png
   :scale: 70

   True and False Positives and Negatives

The intersection of these sets defines four regions: the true
positives (TP), true negatives (TN), false positives (FP), and false
negatives (FN).  Two standard measures are
*precision*, the fraction of guessed chunks that were correct TP/(TP+FP),
and *recall*, the fraction of correct chunks that were identified
TP/(TP+FN).  A third measure, the *F measure*, is the harmonic mean
of precision and recall, i.e. 1/(0.5/Precision + 0.5/Recall).

.. Note: Note that these metrics do not assign any credit for chunks
   that are "almost" right (e.g., chunks that extend one word too long).
   It would be possible to design metrics that do assign partial credit
   for such cases, they would be more complex.  We decided to keep our
   metrics simple, so that it is easy to understand what a given result
   means.

During evaluation of a chunk parser, it is useful to flatten a chunk
structure into a tree consisting only of a root node and leaves:

    >>> from nltk_lite import chunk
    >>> correct = chunk.tagstr2tree(
    ...    "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]")
    >>> correct.flatten()
    (S: ('the', 'DT') ('little', 'JJ') ('cat', 'NN') ('sat', 'VBD')
    ('on', 'IN') ('the', 'DT') ('mat', 'NN'))

We run a chunker over this flattened data, and compare the
resulting chunked sentences with the originals, as follows:

    >>> from nltk_lite.chunk import *
    >>> chunkscore = ChunkScore()
    >>> rule = ChunkRule('<PRP|DT|POS|JJ|CD|N.*>+',
    ...              "Chunk items that often occur in NPs")
    >>> chunkparser = RegexpChunk([rule], chunk_node='NP')
    >>> guess = chunkparser.parse(correct.flatten())
    >>> chunkscore.score(correct, guess)
    >>> print chunkscore
    ChunkParse score:
        Precision: 100.0%
        Recall:    100.0%
        F-Measure: 100.0%

``ChunkScore`` is a class for scoring chunk parsers.  It can be used
to evaluate the output of a chunk parser, using precision, recall,
f-measure, missed chunks, and incorrect chunks.  It can also be used
to combine the scores from the parsing of multiple texts.  This is
quite useful if we are parsing a text one sentence at a time.  The
following program listing shows a typical use of the ``ChunkScore``
class.  In this example, ``chunkparser`` is being tested on each
sentence from the Wall Street Journal tagged files.

    >>> from itertools import islice
    >>> from nltk_lite.corpora import treebank
    >>> rule = ChunkRule('<DT|JJ|NN>+', "Chunk sequences of DT, JJ, and NN")
    >>> chunkparser = RegexpChunk([rule], chunk_node='NP')
    >>> chunkscore = ChunkScore()
    >>> for chunk_struct in islice(treebank.chunked(), 10):
    ...     test_sent = chunkparser.parse(chunk_struct.flatten())
    ...     chunkscore.score(chunk_struct, test_sent)
    >>> print chunkscore
    ChunkParse score:
        Precision:  48.6%
        Recall:     34.0%
        F-Measure:  40.0%

The overall results of the evaluation can be viewed by printing the
``ChunkScore``.  Each evaluation metric is also returned by an
accessor method: ``precision()``, ``recall``, ``f_measure``,
``missed``, and ``incorrect``.  The ``missed`` and ``incorrect``
methods can be especially useful when trying to improve the
performance of a chunk parser.  Here are the missed chunks:

.. doctest-ignore::
    >>> from random import shuffle
    >>> missed = chunkscore.missed()
    >>> shuffle(missed)
    >>> print missed[:10]
    [(('A', 'DT'), ('Lorillard', 'NNP'), ('spokeswoman', 'NN')),
     (('even', 'RB'), ('brief', 'JJ'), ('exposures', 'NNS')),
     (('its', 'PRP$'), ('Micronite', 'NN'), ('cigarette', 'NN'), ('filters', 'NNS')),
     (('30', 'CD'), ('years', 'NNS')),
     (('workers', 'NNS'),),
     (('preliminary', 'JJ'), ('findings', 'NNS')),
     (('Medicine', 'NNP'),),
     (('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP')),
     (('its', 'PRP$'), ('Micronite', 'NN'), ('cigarette', 'NN'), ('filters', 'NNS')),
     (('researchers', 'NNS'),)]

Here are the incorrect chunks:

.. doctest-ignore::
    >>> incorrect = chunkscore.incorrect()
    >>> shuffle(incorrect)
    >> print incorrect[:10]
    [(('New', 'JJ'), ('York-based', 'JJ')),
     (('Micronite', 'NN'), ('cigarette', 'NN')),
     (('a', 'DT'), ('forum', 'NN'), ('likely', 'JJ')),
     (('later', 'JJ'),),
     (('preliminary', 'JJ'),),
     (('New', 'JJ'), ('York-based', 'JJ')),
     (('resilient', 'JJ'),),
     (('group', 'NN'),),
     (('the', 'DT'),),
     (('Micronite', 'NN'), ('cigarette', 'NN'))]

.. Note: By default, only the first 100 missed chunks and the first
   100 incorrect chunks will be remembered by the ``ChunkScore``.  You
   can tell ``ChunkScore`` to record more chunk examples with the
   ``max_fp_examples`` (maximum false positive examples) and the
   ``max_fn_examples`` (maximum false negative examples) keyword
   arguments to the ``ChunkScore`` constructor:

    >>> chunkscore = ChunkScore(max_fp_examples=1000,
    ...                         max_fn_examples=1000)

As we saw with tagging, we need to interpret the performance scores
for a chunker relative to a baseline.  Perhaps the most naive chunking
method is to classify every tag in the training data as to whether it
occurs inside or outside a chunk more often.  We can do this easily
using a chunked corpus and a conditional frequency distribution as
shown below:

    >>> from nltk_lite.probability import ConditionalFreqDist
    >>> from nltk_lite.parse import Tree
    >>> import re
    >>> cfdist = ConditionalFreqDist()
    >>> chunk_data = list(treebank.chunked())
    >>> split = len(chunk_data)*9/10
    >>> train, test = chunk_data[:split], chunk_data[split:]
    >>> for chunk_struct in train:
    ...     for constituent in chunk_struct:
    ...         if isinstance(constituent, Tree):
    ...             for (word, tag) in constituent.leaves():
    ...                 cfdist[tag].inc(True)
    ...         else:
    ...             (word, tag) = constituent
    ...             cfdist[tag].inc(False)
  
    >>> chunk_tags = [tag for tag in cfdist.conditions() if cfdist[tag].max() == True]
    >>> chunk_tags = [re.sub(r'(\W)', r'\\\1', tag) for tag in chunk_tags]
    >>> tag_pattern = '<' + '|'.join(chunk_tags) + '>+'
    >>> print 'Chunking:', tag_pattern
    Chunking: <PRP\$|VBG\|NN|POS|WDT|JJ|WP|DT|\#|\$|NN|FW|PRP|NNS|NNP|LS|PDT|RBS|CD|EX|WP\$|NNPS|JJS|JJR>+

Now, in the evaluation phase we chunk any sequence of those tags:

    >>> rule = ChunkRule(tag_pattern, 'Chunk any sequence involving commonly chunked tags')
    >>> chunkparser = RegexpChunk([rule], chunk_node='NP')
    >>> chunkscore = ChunkScore()
    >>> for chunk_struct in test:
    ...     test_sent = chunkparser.parse(chunk_struct.flatten())
    ...     chunkscore.score(chunk_struct, test_sent)
    >>> print chunkscore
    ChunkParse score:
        Precision:  90.7%
        Recall:     94.0%
        F-Measure:  92.3%


Exercises
---------

#. |soso| **Chunker Evaluation:**
   Carry out the following evaluation tasks for
   any of the chunkers you have developed earlier.
   (Note that most chunking corpora contain some internal
   inconsistencies, such that any reasonable rule-based approach
   will produce errors.)

   a) Evaluate your chunker on 100 sentences from a chunked corpus,
      and report the precision, recall and F-measure.
   b) Use the ``chunkscore.missed()`` and ``chunkscore.incorrect()``
      methods to identify the errors made by your chunker.  Discuss.
   c) Compare the performance of your chunker to the baseline chunker
      discussed in the evaluation section of this chapter.

#. |hard| **Transformation-Based Chunking:**
   Apply the n-gram and Brill tagging methods to IOB chunk tagging.
   Instead of assigning POS tags to words, here we will assign IOB tags
   to the POS tags.  E.g. if the tag ``DT`` (determiner) often occurs
   at the start of a chunk, it will be tagged ``B`` (begin).  Evaluate
   the performance of these chunking methods relative to the regular
   expression chunking methods covered in this chapter.

.. TODO: amplify following example

#. |hard| **Statistically Improbable Phrases:**
   Design an algorithm to find the statistically improbable
   phrases of a document collection.
   http://www.amazon.com/gp/search-inside/sipshelp.html


----------------------
Information Extraction
----------------------

`Information Extraction`:dt: is the task of converting `unstructured
data`:dt: (e.g., unrestricted text) or `semi-structured data`:dt:
(e.g., web pages marked up with |HTML|) into `structured data`:dt:
(e.g., tables in a relational database). For example, let's suppose we
are given a text containing the fragment ie1_, and let's also suppose we
are trying to find pairs of entities *X* and *Y* that stand in the
relation 'organization *X* is located in location *Y*'.

.. _ie1:
.. ex:: ... said William Gale, an economist at the Brookings Institution,
        the research group in Washington. 

|nopar| As a result of processing this text, we should be able to add
the pair |langle|\ *Brookings Institution*, *Washington*\ |rangle| to
this relation. As we will see shortly, Information Extraction proceeds
on the assumption that we are only looking for specific sorts of
information, and these have been decided in advance. This limitation
has been a necessary concession to allow the robust processing of
unrestricted text.

Potential applications of Information Extraction are
many, and include business intelligence, resume harvesting, media
analysis, sentiment detection patent search, and email scanning. A
particularly important area of current research involves the attempt
to extract structured data out of electronically-available scientific
literature, most notably in the domain of biology and medicine. 

Information Extraction is usually broken down into at least two major
steps: `Named Entity Recognition`:dt: and `Relation
Extraction`:dt:. Named Entities (|NE| \s) are usually taken to be noun phrases
that denote specific types of individuals such as organizations,
persons, dates, and so on. Thus, we
might use the following |XML| annotations to mark-up the |NE|\ s in ie1_:

.. _ie2:
.. ex:: ... said <ne type='PERSON'>William Gale</ne>, an economist at
        the <ne type='ORGANIZATION'>Brookings Institution</ne>,
        the research group in <ne type='LOCATION'>Washington<ne>. 

How do we go about identifying |NE|\ s? Our first thought might be
that we could look up candidate expressions in an appropriate list of
names. For example, in the case of locations, we might  try using a
resource such as the `Alexandria Gazetteer
<http://www.alexandria.ucsb.edu/gazetteer/>`_. Depending on the nature
of our input data, this be adequate |mdash| such a gazetteer is likely
to have good coverage of international cities and many locations in
the U.S.A., but will probably be missing the names of obscure villages
in remote regions. However, a list of names for people or organization
will probably have poor coverage. New organizations, and new names for
them, are coming into existence every day, so if we are trying to deal
with contemporary newswire or blog entries, say, it is unlikely that
we will be able to recognize many of the |NE|\ s by using gazetteer
lookup. 

A second consideration is that many |NE| terms are ambiguous. Thus
`May`:lx: and `North`:lx: are likely to be parts of |NE|\ s for DATE
and LOCATION, respectively, but could both be part of a PERSON |NE|;
conversely `Christian Dior`:lx: looks like a PERSON |NE| but is more
likely to be of type ORGANIZATION. A terms like `Yankee`:lx: will be
ordinary modifier in some contexts, but will be marked as an |NE| of
type ORGANIZATION in the phrase `Yankee infielders`:lx:. To summarize,
we cannot reliably detect |NE|\ s by looking them up in a gazetteer,
and it is also hard to develop rules that will correctly recognize
ambiguous |NE|\ s on the basis of their context of
occurrence. Although lookup may contribute to a solution, most
contemporary approaches to Named Entity Recognition treat it as a
statistical classification task that requires training data for good
performance. This task is facilitated by adopting an appropriate data
representation, such as the IOB tags which we saw being deployed in
the |CoNLL| chunk data (Chapter chap-chunk_). For example, here are a
representative few lines from the CONLL 2002 (``conll2002``) Dutch
training data::

    Eddy N B-PER
    Bonte N I-PER
    is V O
    woordvoerder N O
    van Prep O
    diezelfde Pron O
    Hogeschool N B-ORG
    . Punc O

|nopar| As noted before, in this representation, there is one token
per line, each with its part-of-speech tag and its |NE| tag.  When
|NE|\ s have been identified in a text, we then want to extract
relations that hold between them. As indicated earlier, we will
typically be looking for relations between specified types of
|NE|. One way of approaching this task is to initially look for all
triples of the form *X*, |alpha|, *Y*, where *X* and *Y* are |NE| \s
of the required types, and |alpha| is the string of words that
intervenes between *X* and *Y*. We can then use regular expressions to
pull out just those instances of |alpha| that express the relation
that we are looking for. The following example searches for strings
that contain the word `in`:lx:. The special character expression
``(?!\b.+ing\b)`` is a negative lookahead condition which allows us to
disregard strings such as `success in supervising the transition
of`:lx:, where `in`:lx: is followed by a gerundive verb.

    >>> from nltk_lite.contrib.ieer_rels import *
    >>> from itertools import islice
    >>> IN = re.compile(r'.*\bin\b(?!\b.+ing\b)')
    >>> ieer_trees = (d['text'] for d in ieer.dictionary())
    >>> for r in islice(ieer_trees, relextract('ORG', 'LOC', pattern = IN), 29, 39): 
    ...     print show_tuple(r)
    [ORG: Cooper-Hewitt Museum] in [LOC: New York]
    [ORG: Canadian Museum of Civilization] in [LOC: Hull]
    [ORG: Kramerbooks and Afterwords] , an independent store in [LOC: Washington]
    [ORG: Waterford Foundation] in historic [LOC: Waterford]
    [ORG: U.S. District Court] in [LOC: Manhattan]
    [ORG: House Transportation Committee] , secured the most money in the [LOC: New York]
    [ORG: Angels] in [LOC: Anaheim]
    [ORG: Bayona] , which is in the heart of the [LOC: French Quarter]
    [ORG: CBS Studio Center] in [LOC: Studio City]
    [ORG: Smithsonian Institution] in [LOC: Washington]

|nopar| As you will see, although searching for `in`:lx: does a reasonably
good job, there is some inaccuracy in the output which is hard to
avoid |mdash| there is unlikely to be simple string-based method of excluding
fillers such as `secured the most money in the`:lx:.

As shown above, the ``conll2002`` corpus contains not just |NE|
annotation but also part-of-speech tags. In principle, this allows us
to devise patterns which are sensitive to these tags.

    >>> vnv = """
    >>> (
    >>> is/V|
    >>> was/V|
    >>> werd/V|
    >>> wordt/V
    >>> )
    >>> .*
    >>> van/Prep
    >>> """
    >>> VAN = re.compile(vnv, re.VERBOSE)
    >>> for r in relextract('PER', 'ORG', corpus='conll2002-ned', pattern = VAN): 
    ...     print show_tuple(r)



Exercises
---------


.. TODO: IE evaluations:
   
   TIPSTER (1991-1998) incl MUC
   ACE (since 2000)
   - entity detection and recognition (extract mentions, group references to same entities)
   - relation detection and recognition
   - event detection
   - Arabic, Chinese, English
   GALE (since 2005)
   - Global Autonomous Language Exploitation
   - information etracted from multilingual input
   - from transcribed or translated text
   - identify information relevant to a user's query
   - so no pre-defined ontology (challenge for current IE systems)
   CoNLL (since 1997) -- bottom up initiative


-----------
Conclusions
-----------

[To be written]

---------------
Further Reading
---------------


Sekine -- 140 types of NE.

.. include:: footer.txt

