Deleted stuff with no home at present

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<!--
  <section><title>Linguistics and Natural Language Processing (draft)</title>

  <para>
    [What is the relationship between linguistics and NLP?
    Goal of (generative) linguistics to account for the grammaticality
    judgements of the ideal monolingual speaker/hearer, vs goal of
    NLP to build systems to map between the (linguistic) systems of
    humans and machines.  Challenge of linguistics is to balance
    descriptive and explanatory adequacy; challenge of NLP to balance
    expressiveness and tractability.]
  </para>

    <para>
    [Grammar as a definition of well-formed
    sentences along with a semantic translation, versus
    an implementation which (say) maps from sentences to
    meanings (parser) or vice versa (generator).
    declarative vs procedural;
    system of rewriting rules vs automaton;
    perspective on NLP: relating the declarative to the procedural;
    distinguish this constrast from competence vs performance.]
    </para>

    <para>
    In the late 1980s and early 1990s there was a promising
    convergence between the fields of linguistics and NLP.  (This had
    been a feature of the 1960s, e.g. with the application of the SPE
    model in speech synthesis systems.)  Computational linguists often
    looked to linguistics as a source of knowledge about language.
    Over the last decade we have seen a new divergence, as
    computational linguists have discovered that linguistic analyses
    often failed to account for the linguistic patterns attested in
    the large corpora used to develop their systems.  However, once
    linguists learn to work with these large datasets, their own
    analytical work will benefit, leading to broader coverage of their
    theories, and earlier refutation of false hypotheses.  The result,
    we expect, will be new opportunities for cross-fertilization
    between linguistics and NLP.
    </para>

    <para>
    [Opportunities for linguists to contribute their insights to the
    future development of NLP and, in the reverse direction, to apply
    the results of NLP research back in linguistics.]
    </para>

  </section>
-->

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<!--
 <para>
    In the following sections, we will give a more detailed account of
    the linguistic and practical issues that arise in the course of
    part-of-speech tagging, and then survey how tagging is carried out
    in NLTK. Before launching into this, however, we will give the
    reader a flavour of the uses of tagging. That is,
    we consider three kinds of language analysis where tags play
    an important role: parsing, morphological analysis, and
    stylistics.
  </para>

  <para>
    Most natural language parsers depend on <glossterm>part-of-speech
    tags</glossterm>.  Instead of writing rules like ``NP &rarr;
    the dog`` and ``NP &rarr; three red cars``,
    we can write ``NP &rarr; DT JJ* NN``.  In this way,
    the terminal symbols of the grammar can be word categories, instead
    of words, greatly reducing the size of the grammar.

  </para>



  <para>
    <glossterm>Morphological analysis</glossterm> is also assisted by part-of-speech tags.
    For instance, if we encounter the word ``deals``
    in running text, should this be analysed as the plural form of a
    noun, e.g., ``deal<subscript>N</subscript>+PL``
    or the third-person singular form of a verb, e.g.,
    ``deal<subscript>V</subscript>+3PS``?
    A tagger will consider the context in which this word appears,
    and will reliably determine whether it is a noun or a verb.
    Then the morphological analyser can be given either
    ``deals/NN`` or ``deals/VB``
    to process.
  </para>
-->

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Simple approach: "delete one".  First define a function that returns
a list of strings, each one having a different character deleted from
the supplied form:

  >>> def delete_one(word):
  ...     for i in range(len(word)):
  ...         yield word[:i]+word[i+1:]

Next construct an index over all these forms:

  >>> idx = {}
  >>> for lex in lexemes:
  ...     for s in delete_one(lex):
  ...         if s not in idx:
  ...             idx[s] = set()
  ...         idx[s].add(lex)

Now we can define a lookup function:

  >>> def lookup(word):
  ...     candidates = set()
  ...     for s in delete_one(word):
  ...         if s in idx:
  ...             candidates.update(idx[s])
  ...     return candidates

Now we can test it out:

  >>> lookup('kokopouto')
  set(['kokopeoto', 'kokopuoto'])
  >>> lookup('kokou')
  set(['kokoa', 'kokeu', 'kokio', 'kooru', 'kokoi', 'kooku', 'kokoo'])

Note that this simple method only returns forms of the same length.

#. Write a spelling correction function which, given a word of length
   ``i``, can return candidate corrections of length ``i-1``, ``i``,
   or ``i+1``.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



---------------
NLTK Interfaces
---------------

An *interface* gives a partial specification of the behavior of a
class, including specifications for methods that the class should
implement.  For example, a "comparable" interface might specify that a
class must implement a comparison method.  Interfaces do not give a
complete specification of a class; they only specify a minimum set of
methods and behaviors which should be implemented by the class.  For
example, the ``TaggerI`` interface specifies that a tagger class must
implement a ``tag`` method, which takes a ``string``, and returns a
tuple, consisting of that string and its part-of-speech tag; but it
does not specify what other methods the class should implement (if
any).

.. note:: The notion of "interfaces" can be very useful in ensuring that
   different classes work together correctly.  Although the concept of
   "interfaces" is supported in many languages, such as Java, there is no
   native support for interfaces in Python.

NLTK therefore implements interfaces using classes, all of whose
methods raise the ``NotImplementedError`` exception.  To distinguish
interfaces from other classes, they are always named with a trailing
``I``.  If a class implements an interface, then it should be a
subclass of the interface.  For example, the ``Ngram`` tagger class
implements the ``TaggerI`` interface, and so it is a subclass of
``TaggerI``.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



.. 
  A piece to be relocated
  -----------------------

  Many natural language expressions are ambiguous, and we need to draw
  on other sources of information to aid interpretation.  For instance,
  our preferred interpretation of `fruit flies like a
  banana`:lx: depends on the presence of contextual cues that cause
  us to expect `flies`:lx: to be a noun or a verb.  Before
  we can even address such issues, we need to be able to represent the
  required linguistic information.  Here is a possible representation:


  =========  =========  ========  =====  ==========
  ``Fruit``  `flies``  `like``  `a``  `banana``
  noun       verb       prep      det    noun
  =========  =========  ========  =====  ==========

  =========  =========  ========  =====  ==========
  ``Fruit``  `flies``  `like``  `a``  `banana``
  noun       noun       verb      det    noun
  =========  =========  ========  =====  ==========

  Most language processing systems must recognize and interpret the
  linguistic structures that exist in a sequence of words.  This task is
  virtually impossible if all we know about each word is its text
  representation.  To determine whether a given string of words has the
  structure of, say, a noun phrase, it is infeasible to check through a
  (possibly infinite) list of all strings which can be classed as noun
  phrases.  Instead we want to be able to generalize over *classes`:dt: of
  words. These word classes are commonly given labels such as
  'determiner', 'adjective' and 'noun'.  Conversely, to interpret words
  we need to be able to discriminate between different usages, such as
  ``deal`` as a noun or a verb.  

  We earlier presented two interpretations of `Fruit flies like a
  banana`:lx: as examples of how a string of word tokens can be augmented
  with information about the word classes that the words belong to. In
  effect, we carried out tagging for the string `fruit flies like a
  banana`:lx:. However, tags are more usually attached inline to the text
  they are associated with. 
	

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


.. ===================== UNUSED =====================

   Some of the following material might belong in other chapters

   Programming?
   ------------
   The ``nltk_lite.corpora`` package provides ready access to several
   corpora included with NLTK, along with built-in tokenizers.  For
   example, ``brown.raw()`` is an iterator over sentences from
   the Brown Corpus.  We use ``extract()`` to extract a sentence of
   interest:
   
     >>> from nltk_lite.corpora import brown, extract
     >>> print extract(0, brown.raw('a'))
     ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
   
   Old intro material
   ------------------
   How do we know that piece of text is a *word*, and how do we represent
   words and associated information in a machine?  It might seem
   needlessly picky to ask what a word is. Can't we just say that a word
   is a string of characters which has white space before and after it?
   However, it turns out that things are quite a bit more complex. To get
   a flavour of the problems, consider the following text from the Wall
   Street Journal::
   
     Let's start with the string `aren't`:lx:. According to our naive
     definition, it counts as only one word. But consider a situation where
     we wanted to check whether all the words in our text occurred in a
     dictionary, and our dictionary had entries for `are`:lx: and `not`:lx:,
     but not for `aren't`:lx:.  In this case, we would probably be happy to
     say that `aren't`:lx: is a contraction of two distinct words.
   
   
   .. We can make a similar point about `1992's`:lx:. We might want to run
      a small program over our text to extract all words which express
      dates. In this case, we would get achieve more generality by first
      stripping except in this case, we would not expect to find
      `1992`:lx: in a dictionary.
   
   If we take our naive definition of word literally (as we should, if we
   are thinking of implementing it in code), then there are some other
   minor but real problems. For example, assuming our file consists of a
   number of separate lines, as in the WSJ text, then all the
   words which come at the beginning of a line will fail to be preceded
   by whitespace (unless we treat the newline character as a
   whitespace). Second, according to our criterion, punctuation symbols
   will form part of words; that is, a string like `investors,`:lx: will
   also count as a word, since there is no whitespace between
   `investors`:lx: and the following comma. Consequently, we run the risk
   of failing to recognise that `investors,`:lx: (with appended comma) is a
   token of the same type as `investors`:lx: (without appended comma). More
   importantly, we would like punctuation to be a "first-class citizen"
   for tokenization and subsequent processing. For example, we might want
   to implement a rule which says that a word followed by a period is
   likely to be an abbreviation if the immediately following word has a
   lowercase initial. However, to formulate such a rule, we must be able
   to identify a period as a token in its own right.
   
   A slightly different challenge is raised by examples such as the
   following (drawn from the MedLine corpus):
   
   #. This is a alpha-galactosyl-1,4-beta-galactosyl-specific adhesin.
   
   #. The corresponding free cortisol fractions in these sera were 4.53
      +/- 0.15% and 8.16 +/- 0.23%, respectively.
   
   In these cases, we encounter terms which are unlikely to be found in
   any general purpose English lexicon. Moreover, we will have no success
   in trying to syntactically analyse these strings using a standard
   grammar of English. Now for some applications, we would like to
   "bundle up" expressions such as
   `alpha-galactosyl-1,4-beta-galactosyl-specific adhesin`:lx: and `4.53
   +/- 0.15%`:lx: so that they are presented as unanalysable atoms to the
   parser. That is, we want to treat them as single "words" for the
   purposes of subsequent processing.  The upshot is that, even if we
   confine our attention to English text, the question of what we treat
   as word may depend a great deal on what our purposes are.
   
   Representing tokens
   -------------------
   When written language is stored in a computer file it is normally
   represented as a sequence or *string* of characters.  That is, in a
   standard text file, individual words are strings, sentences are
   strings, and indeed the whole text is one long string. The characters
   in a string don't have to be just the ordinary alphanumerics; strings
   can also include special characters which represent space, tab and
   newline.
   
   Most computational processing is performed above the level of
   characters.  In compiling a programming language, for example, the
   compiler expects its input to be a sequence of tokens that it knows
   how to deal with; for example, the classes of identifiers, string
   constants and numerals.  Analogously, a parser will expect its input
   to be a sequence of word tokens rather than a sequence of individual
   characters.  At its simplest, then, tokenization of a text involves
   searching for locations in the string of characters containing
   whitespace (space, tab, or newline) or certain punctuation symbols,
   and breaking the string into word tokens at these points.  For
   example, suppose we have a file containing the following two lines::
   
     The cat climbed
     the tree.
   
   From the parser's point of view, this file is just a string of
   characters:
   
     'The_cat_climbed\\n_the_tree.'
   
   Note that we use single quotes to delimit strings, "_" to represent
   space and "\n" to represent newline.
   
   As we just pointed out, to tokenize this text for consumption by the
   parser, we need to explicitly indicate which substrings are words. One
   convenient way to do this in Python is to split the string into a
   *list* of words, where each word is a string, such as
   `'dog'`:lx:. [#]_ 
   In Python, lists are printed as a series of objects
   (in this case, strings), surrounded by square brackets and separated
   by commas:
   
     >>> words = ['the', 'cat', 'climbed', 'the', 'tree']
     >>> words
     ['the', 'cat', 'climbed', 'the', 'tree']
   
   .. [#] We say "convenient" because Python makes it easy to iterate
          through a list, processing the items one by one.
   
   Notice that we have introduced a new variable `words`:lx: which is bound
   to the list, and that we entered the variable on a new line to check
   its value.
   
   To summarize, we have just illustrated how, at its simplest,
   tokenization of a text can be carried out by converting the single
   string representing the text into a list of strings, each of which
   corresponds to a word.
   
   Some of this could maybe be discussed in the programming chapter?
   ----------------------------------------------------------------
   Many natural language processing tasks involve analyzing texts of
   varying sizes, ranging from single sentences to very large corpora.
   There are a number of ways to represent texts using NLTK.  The
   simplest is as a single string.  These strings can be loaded directly
   from files:
   
     >>> text_str = open('corpus.txt').read() 
     >>> text_str
     'Hello world.  This is a test file.\n'
   
   However, as noted above, it is usually preferable to represent a text
   as a list of tokens.  These lists are typically created using a
   *tokenizer*, such as `tokenize.whitespace`:lx: which splits strings into
   words at whitespaces:
   
     >>> from nltk_lite import tokenize
     >>> text = 'Hello world.  This is a test string.'
     >>> list(tokenize.whitespace(text))
     ['Hello', 'world.', 'This', 'is', 'a', 'test', 'string.']
   
   .. Note:: By "whitespace", we mean not only interword space, but
      also tab and line-end.
   
   Note that tokenization may normalize the text, mapping all words to lowercase,
   expanding contractions, and possibly even stemming the words.  An
   example for stemming is shown below:
   
        >>> text = 'stemming can be fun and exciting'
        >>> tokens = tokenize.whitespace(text)
        >>> porter = tokenize.PorterStemmer()
        >>> for token in tokens:
        ...     print porter.stem(token),
        stem can be fun and excit
   
   Tokenization based on whitespace is too simplistic for most
   applications; for instance, it fails to separate the last word of a
   phrase or sentence from punctuation characters, such as comma, period,
   exclamation mark and question mark.  As its name suggests,
   `tokenize.regexp`:lx: employs a regular expression to determine how text
   should be split up.  This regular expression specifies the characters
   that can be included in a valid word.  To define a tokenizer that
   includes punctuation as separate tokens, we could use:


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

------------
More Grammar
------------

In this final section, we return to the grammar of English.  We
consider more syntactic phenomena that will require us to refine
the productions of our phrase structure grammar.

Lexical heads other than `V`:gc: can be subcategorized for particular
complements:

**Nouns**
   #. The rumour *that Kim was bald* circulated widely.
   #. \*The picture *that Kim was bald* circulated widely.
**Adjectives**
   #. Lee was afraid *to leave*.
   #. \*Lee was tall *to leave*.

It has also been suggested that 'ordinary' prepositions are
transitive, and that many so-called adverb are in fact intransitive
prepositions. For example, `towards`:lx: requires an `NP`:gc: complement,
while `home`:lx: and `forwards`:lx: forbid them.


.. example:: Lee ran towards the house.
.. example:: \*Lee ran towards.

.. example:: Sammy walked home.
.. example:: \*Sammy walked home the house.

.. example:: Brent stepped one pace forwards.
.. example:: \*Brent stepped one pace forwards the house.


Adopting this approach, we can also analyse certain prepositions as
allowing `PP`:gc: complements:

.. example:: Kim ran away *from the house*.
.. example:: Lee jumped down *into the boat*.

In general, the lexical categories `V`:gc:, `N`:gc:, `A`:gc: and `P`:gc: are
taken to be the heads of the respective phrases `VP`, `NP`,
`AP`:gc: and `PP`:gc:. Abstracting over the identity of these phrases, we
can say that a lexical category `X`:gc: is the head of its immediate
`XP`:gc: phrase, and moreover that the complements `C`:subscript:`1`
... `C`:subscript:`n` of
`X`:gc: will occur as sisters of `X`:gc: within that `XP`:gc:. This is
illustrated in the following schema:

.. ex::
  .. tree:: (XP (X) (*C_1*) ... (*C_n*))

We have argued that lexical categories need to be subdivided into
subcategories to account for the fact that different lexical items
select different sequences of following complements. That is, it is a
distinguishing property of complements that they co-occur with some
lexical items but not others. By contrast, :dt:`modifiers` can
occur with pretty much any instance of the relevant lexical class. For
example, consider the temporal adverbial *last Thursday*:

.. example:: The woman gave the telescope to the dog last Thursday.
.. example:: The woman saw a man last Thursday.
.. example:: The dog barked last Thursday.

Moreover, modifiers are always optional, whereas complements are at
least sometimes obligatory. We can use the phrase structure
geometry to draw a structural distinction between complements, which
occur as sisters of the lexical head, versus modifiers, which occur as
sisters of the phrase which encloses the head:

.. ex::
  .. tree:: (XP (XP (X) (*C_1*) ... (*C_n*)) (*Mod*))

Exercises
---------

#. Pick some of the syntactic constructions described in any
   introductory syntax text (e.g. Jurafsky and Martin, Chapter 9) and
   create a set of 15 sentences.  Five sentences should be unambiguous
   (have a unique parse), five should be ambiguous, and a further five
   should be ungrammatical.

  a) Develop a small grammar, consisting of about ten syntactic
     productions, to account for this data.  Refine your set of sentences
     as needed to test and demonstrate the grammar.  Write a function
     to demonstrate your grammar on three sentences: (i) a
     sentence having exactly one parse; (ii) a sentence having more than
     one parse; (iii) a sentence having no parses.  Discuss your
     observations using inline comments.

  b) Create a list ``words`` of all the words in your lexicon, and use
     ``random.choice(words)`` to generate sequences of 5-10 randomly
     selected words.  Does this generate any grammatical sentences which
     your grammar rejects, or any ungrammatical sentences which your
     grammar accepts?  Now use this information to help you improve your
     grammar.  Discuss your findings.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

NLTK consists of a set of Python *modules*, each of which defines
classes and functions related to a single data structure or task.
Before you can use a module, you must ``import`` its contents.  The
simplest way to import the contents of a module is to use the ``from
module import *`` command.  For example, to import the contents of the
``nltk_lite.util`` module, which is discussed in this chapter, type:

  >>> from nltk_lite.utilities import *
  >>>

A disadvantage of this style of import statement is that it does not
specify what objects are imported; and it is possible that some of the
import objects will unintentionally cause conflicts.  To avoid this
disadvantage, you can explicitly list the objects you wish to import.
For example, as we saw earlier, we can import the ``re_show`` function
from the ``nltk_lite.util`` module as follows:

  >>> from nltk_lite.utilities import re_show
  >>>

Another option is to import the module itself, rather than
its contents.  Now its contents can then be accessed
using *fully qualified* dotted names:

  >>> from nltk_lite import utilities
  >>> utilities.re_show('green', sent)
  colorless {green} ideas sleep furiously
  >>>

For more information about importing, see any Python textbook.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



We can also access the tagged text using the ``brown.tagged()`` method:

  >>> print extract(0, brown.tagged())
  [('The', 'at'), ('Fulton', 'np-tl'), ('County', 'nn-tl'), ('Grand', 'jj-tl'),
   ('Jury', 'nn-tl'), ('said', 'vbd'), ('Friday', 'nr'), ('an', 'at'),
   ('investigation', 'nn'), ('of', 'in'), ("Atlanta's", 'np$'), ('recent', 'jj'),
   ('primary', 'nn'), ('election', 'nn'), ('produced', 'vbd'), ('``', '``'),
   ('no', 'at'), ('evidence', 'nn'), ("''", "''"), ('that', 'cs'),
   ('any', 'dti'), ('irregularities', 'nns'), ('took', 'vbd'), ('place', 'nn'),
   ('.', '.')]
  >>>

NLTK includes a 10% fragment of the Wall Street Journal section
of the Penn Treebank.  This can be accessed using ``treebank.raw()``
for the raw text, and ``treebank.tagged()`` for the tagged text.

  >>> from nltk_lite.corpora import treebank
  >>> print extract(0, treebank.parsed())
  (S:
    (NP-SBJ:
      (NP: (NNP: 'Pierre') (NNP: 'Vinken'))
      (,: ',')
      (ADJP: (NP: (CD: '61') (NNS: 'years')) (JJ: 'old'))
      (,: ','))
    (VP:
      (MD: 'will')
      (VP:
        (VB: 'join')
        (NP: (DT: 'the') (NN: 'board'))
        (PP-CLR:
          (IN: 'as')
          (NP: (DT: 'a') (JJ: 'nonexecutive') (NN: 'director')))
        (NP-TMP: (NNP: 'Nov.') (CD: '29'))))
    (.: '.'))
  >>>

NLTK contains some simple chatbots, which will try to talk
intelligently with you.  You can access the famous Eliza
chatbot using ``from nltk_lite.chat import eliza``, then
run ``eliza.demo()``.  The other chatbots are called
``iesha`` (teen anime talk),
``rude`` (insulting talk), and
``zen`` (gems of Zen wisdom),
and were contributed by other students who have used NLTK.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Predicting the Next Word (Revisited)
------------------------------------


    >>> from nltk_lite.corpora import genesis
    >>> from nltk_lite.probability import ConditionalFreqDist
    >>> cfdist = ConditionalFreqDist()

We then examine each token in the corpus, and increment the
appropriate sample's count.  We use the variable ``prev`` to record
the previous word.

    >>> prev = None
    >>> for word in genesis.raw():
    ...     cfdist[prev].inc(word)
    ...     prev = word

.. Note:: Sometimes the context for an experiment is unavailable, or
   does not exist.  For example, the first token in a text does not
   follow any word.  In these cases, we must decide what context to
   use.  For this example, we use ``None`` as the context for the
   first token.  Another option would be to discard the first token.

Once we have constructed a conditional frequency distribution for the
training corpus, we can use it to find the most likely word for any
given context. For example, taking the word `living`:lx: as our context,
we can inspect all the words that occurred in that context.

    >>> word = 'living'
    >>> cfdist[word].samples()
    ['creature,', 'substance', 'soul.', 'thing', 'thing,', 'creature']

We can set up a simple loop to generate text: we set an initial
context, picking the most likely token in that context as our next
word, and then using that word as our new context:

    >>> word = 'living'
    >>> for i in range(20):
    ...     print word,
    ...     word = cfdist[word].max()
    living creature that he said, I will not be a wife of the land
    of the land of the land


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

