.. -*- mode: rst -*-
.. include:: ../definitions.txt

.. _chap-words:

=========================================
3. Words: The Building Blocks of Language
=========================================

.. TODO: use UHDR corpus instead of Genesis for wordlength plot
.. TODO: simplified tagset (N, V, J, A) -- WordNet
.. TODO: cleaning web text
.. TODO: more on regular expressions, including ()

------------
Introduction
------------

|nopar|
Language can be divided up into pieces of varying sizes, ranging from
morphemes to paragraphs.  In this chapter we will focus on words, a
very important level for much work in |NLP|.  Just what are words, and
how should we represent them in a machine?  These may seem like
trivial questions, but it turns out that there are some important
issues involved in defining and representing words.

In the following sections, we will explore the division of text into
words; the distinction between types and tokens; sources of text data
including files, the web, and linguistic corpora; accessing these
sources using Python and NLTK; stemming and normalisation;
Wordnet; and a variety of useful programming tasks involving words

-----------------------
Tokens, Types and Texts
-----------------------

In Chapter chap-introduction_, we showed how a string could be split into a
list of words.  Once we have derived a list, the ``len()`` function
will count the number of words for us:

    >>> sentence = "This is the time -- and this is the record of the time."
    >>> words = sentence.split()
    >>> len(words)
    13

|nopar| This process of segmenting a string of characters into words
is known as `tokenization`:dt:. Tokenization is a prelude to pretty
much everything else we might want to do in |NLP|, since it tells our
processing software what our basic units are. We will discuss
tokenization in more detail shortly.


We also pointed out that we could compile a list of the unique
vocabulary items in a string by using ``set()`` to eliminate
duplicates:

    >>> len(set(words))
    10

|nopar| So if we ask how many words there are in ``sentence``, we get two
different answers, depending on whether we count duplicates or not. 
Clearly we are using different senses of 'word' here. To help
distinguish between them, let's introduce two terms:
`token`:dt: and `type`:dt:.  A word token is an individual occurrence
of a word in a concrete context; it exists in time and space.  A word
`type`:dt: is a more abstract; it's what we're talking about when we
say that the three occurrences of ``the`` in ``sentence`` are  'the same word'.

Something similar to a type/token distinction is reflected in the following
snippet of Python:

    >>> words[2]
    'the'
    >>> words[2] == words[8]
    True
    >>> words[2] is words[8]
    False
    >>> words[2] is words[2]
    True

|nopar| The operator ``==`` tests whether two expressions are equal, and in
this case, it is testing for string-identity. This is the notion of
identity that was assumed by our use of ``set()`` above. By contrast, the
``is`` operator tests whether two objects are stored in the same
location of memory, and is therefore analogous to
token-identity.  

In effect, when we used ``split()`` above to turn a string into a list
of words, our tokenization method was to say that any strings which
are delimited by whitespace count as a word token. But this simple
approach doesn't always lead to the results we want. Moreover,
string-identity doesn't always give us a useful criterion for
assigning tokens to types. We therefore need to address two questions
in more detail:

*Tokenization*: 
  Which substrings of the original text should be treated as word tokens?
 
*Type definition*: 
  How do we decide whether two tokens have the same type?

To see the problems with our first stab at defining tokens and types
in ``sentence``, let's look more closely at what is contained in ``set(words)``:

    >>> set(words)
    set(['and', 'this', 'record', 'This', 'of', 'is', '--', 'time.',
    'time', 'the']

|nopar|
One point to note is that ``'time'`` and ``'time.'`` come out as distinct
tokens, and of necessity, distinct types, since the trailing period
has been bundled up with the rest of the word into a single token. We
might also argue that although ``'--'`` is some kind of token, it
isn't really a `word`:em: token. Third, we would probably want to say
that ``'This'`` and ``'this'`` are not distinct types, since
capitalization should be ignored.

.. 
   So now we have said that
   any string of characters delimited by white space counts as a
   word. Unfortunately, there are a number of problems with this simple
   approach, and we will describe a few of them here. To begin with, some
   of our tokens really shouldn't be counted as words.  For
   example, wsj0_  contains the following sentence:

   .. _wsj1:
   .. ex::
      .. parsed-literal::

	A spokesman for Temple estimated that Sea Containers' plan -- if all
	the asset sales materialize -- would result in shareholders
	receiving only $36 to $45 a share in cash.

   Our simple algorithm incorrectly treats each occurrence of ``'--'`` as a word.

   Conversely, there are times when we want to
   treat a character sequence that contains whitespace as a single
   word. For example, we might want our grammar to recognize strings
   like `New York`:lx:, or `John Smith`:lx: as proper nouns without any
   further decomposition.  Hyphenation adds to the
   complexity; sometimes it makes more sense to treat hyphenated strings
   as two words.  For example, in the expression `pro-New York`:lx:, we
   do not want to say that `pro-New`:lx: is a single word.

   .. I know I've seen better examples of this than "pro-New York, but
      I can't think of one right now..


   As an especially subtle example, consider the numeric expressions in
   the following sentence (drawn from the MedLine corpus):

   .. _tok1:
   .. ex::

       The corresponding free cortisol fractions in these sera were 4.53
       +/- 0.15% and 8.16 +/- 0.23%, respectively.

   Should we say that the numeric expression `4.53 +/- 0.15%`:lx: is three
   words?  Or should we say that it's a single compound word?  Or should
   we say that it is actually *nine* words, since it's read "four point
   five three, plus or minus fifteen percent"?  Or should we say that
   it's not a 'real' word at all, since it wouldn't appear in any
   dictionary?  The answer will depend on what task we're
   trying to solve.


If we turn to languages other than English, segmenting words
can be even more of a challenge. For example, in Chinese
orthography, characters correspond to monosyllabic morphemes. Many
morphemes are words in their own right, but words contain more than
one morpheme. However, there is no visual representation of word
boundaries in Chinese text. For example, consider the following
three-character string: |ai4| |guo3| |ren2| (in pinyin plus tones:
ai4 'love' (verb), guo3 'country', ren2 'person'). This could
either be segmented as [|ai4| |guo3|] |ren2| |mdash|
'country-loving person' or as |ai4| [|guo3| |ren2|] |mdash| 'love
country-person'.

.. 
   Counting Word Types
   -------------------

   As before, we will start with a simple approach: tokenize on any
   sequence of whitespace, and count the number of times each word type
   occurs.  This time we'll take the list of tokens returned by `text.split()`,
   convert it to a set, then convert it back to a list:

       >>> word_types = set(text.split())
       >>> sorted(word_types)
       ['$1.1', '$130', '$36', '$45', ..., 'with', 'would', 'yesterday,']
       >>> print len(word_types)
       255


.. 
   In addition to the issues we saw earlier with counting tokens, our
   algorithm's ignorance about punctuation can cause it to count the same
   type multiple times.  For example, the substrings ``'price.'`` and
   ``'price'`` (i.e., with and without a trailing period) will be treated
   as distinct types. When we were counting tokens, this didn't affect
   our overall answer, since both ``'price.'`` and ``'price'`` counted as
   a single token.  But since the two strings are not identical, our
   algorithm counts them as two separate types.  These tokenization
   issues could be addressed by defining a more sophisticated tokenizer.


   The type definition issues are exemplified by the fact that `the`:lx:
   and `The`:lx: are listed as distinct types.  Instead we would like to
   say that these tokens have the same type, and just happen to be
   written differently.  In other words, we would like to have a way to
   compare strings, ignoring distinctions such as case.  A more subtle
   question is whether the two tokens `asset`:lx: and `assets`:lx: should
   be considered to share the same type.  On the one hand, they would
   both be listed under the same entry in a dictionary; but on the other
   hand, there is a definite semantic difference between them. As with
   some of the questions about tokenization, our decision about whether
   to treat these two tokens as having the same type will depend on the
   problem we're trying to solve. Further along the spectrum, although
   the `likes`:lx: in tok3_ is string-identical to the `likes`:lx: in
   tok2_, the fact that it is a plural noun rather than verb would lead
   us to say that it is a distinct word type:

   .. _tok3:
   .. ex:: Your likes are quite unpredictable.

The terms *token* and *type* can also be applied to other linguistic
entities.  For example, a `sentence token`:dt: is an individual
occurrence of a sentence; but a `sentence type`:dt: is an abstract
sentence, without context. If I say the same sentence twice, I have
uttered two sentence tokens but only used one sentence type.  When the kind
of token or type is obvious from context, we will simply use the terms
token and type.

To summarize, although the type/token distinction is a useful
one, we cannot just say that two word tokens have the same type if they are
the same string of characters |mdash| we need to take into
consideration a number of other factors in determining what counts as
the same word. Moreover, we also need to be more careful in how we
identify tokens in the first place. 

Up till now, we have relied on getting our source texts by defining a string in
a fragment of Python code. However, this is an impractical approach
for all but the simplest of texts, and makes it hard to present
realistic examples. So how do we get larger chunks of
text into our programs?  In the rest of this section, we will see how
to extract text from files, from the web, and from the corpora
distributed with |NLTK|.

.. _reading-files:

Extracting text from files
--------------------------

.. Monkey-patching to fake the file/web examples in this section:

    >>> from StringIO import StringIO
    >>> def fake_open(filename, mode=None):
    ...     return StringIO('Hello World!\nThis is a test file.\n')
    >>> def fake_urlopen(url):
    ...     return StringIO(' BBC NEWS | News Front Page News Sport '
    ...                     'Weather World Service')
    >>> open = fake_open
    >>> import urllib
    >>> urllib.urlopen = fake_urlopen

It is easy to access local files in Python.  As an exercise, create a
file called ``corpus.txt`` using a text editor, and enter the
following text::

   Hello World!
   This is a test file.


|nopar| 
Be sure to save the file as plain text. You also need to make sure
that you have saved the file in the same directory or folder in which
you are running the Python interactive interpreter.

.. Note:: If you are using |IDLE|, you can easily create this file by
   selecting the *New Window* command in the *File* menu, typing in
   the required text into this window, and then saving the file as
   ``corpus.txt`` in the first directory that |IDLE| offers in the
   pop-up dialogue box.


|nopar| 
The next step is to `open`:dt: a file using the built-in function ``open()``,
which takes two arguments, the name of the file, here ``corpus.txt``, and the
mode to open the file with (``'r'`` means to open the file for reading,
and ``'U'`` stands for "Universal", which lets us ignore the different
conventions used for marking newlines).

.. doctest-ignore::
    >>> f = open('corpus.txt', 'rU')

.. Note:: If the interpreter cannot find your file, it will give an
   error like this:

   .. doctest-ignore::
      >>> f = open('corpus.txt', 'rU')
      Traceback (most recent call last):
	  File "<pyshell#7>", line 1, in -toplevel-
	  f = open('foo.txt', 'rU')
      IOError: [Errno 2] No such file or directory: 'corpus.txt'

   To check that the file that you are trying to open is really in the
   right directory, use |IDLE|\ 's *Open* command in the *File* menu;
   this will display a list of all the files in the directory where
   |IDLE| is running. An alternative is to examine the current
   directory from within Python:

.. doctest-ignore::
      >>> import os
      >>> os.listdir('.')

|nopar| 
To read the contents of the file we can use lots of different methods.
The following uses the read method ``read()`` on the file object
``f``; this reads the entire contents of a file into a string.

    >>> f.read() 
    'Hello World!\nThis is a test file.\n'

|nopar|
You will recall that the strange ``'\n'`` character on the end of
the string is a `newline`:dt: character; this is equivalent to
pressing *Enter* on a keyboard and starting a new line. 
.. There is also a ``'\t'`` character for representing tab.
Note that we can open and read a file in one step:

    >>> text = open('corpus.txt', 'rU').read()

|nopar|
We can also read a file one line at a time using the ``for`` loop construct:

    >>> f = open('corpus.txt', 'rU')
    >>> for line in f:
    ...     print line[:-1]
    Hello world!
    This is a test file.

|nopar| 
Here we use the slice ``[:-1]`` to remove the newline character at the end of
the input line.  

Extracting text from the Web
----------------------------

To read in a web page, we use ``urlopen()``:

    >>> from urllib import urlopen
    >>> page = urlopen("http://news.bbc.co.uk/").read()
    >>> print page[:60]
    <!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"

|nopar| 
Web pages are usually in |HTML| format.  To extract the plain text, we
can strip out the |HTML| markup, that is remove all material enclosed in
angle brackets. Let's digress briefly to consider how to carry out
this task using regular expressions. Our first attempt might look as
follows:

    >>> line = '<title>BBC NEWS | News Front Page</title>'
    >>> import re
    >>> new = re.sub(r'<.*>', '', line)

|nopar| 
So the regular expression ``'<.*>'`` is intended to match a
pair of left and right angle brackets, with a string of any characters
intervening. However, look at what the result is:

    >>> new
    '' 

|nopar| 
What has happened here? The problem is two-fold. First, as already
noted, the wildcard ``'.'`` matches any character other than
``'\n'``, so in particular it will match ``'>'`` and ``'<'``. Second,
the ``'*'`` operator is 'greedy', in the sense that it matches
as many characters as it can. In the example we just looked at,
therefore, ``'.*'`` will return not the shortest match, namely
``'title'``, but the longest match, ``'title>BBC NEWS | News Front
Page</title'``. 

In order to get the results we want, we need to think about the task
in a slightly different way. Our assumption is that after we have
encountered a ``'<'``, any character can occur within the tag except a
``'>'``; once we find the latter, we know the tag is closed. Now, we
have already seen how to match everything but |alpha|, for some
character |alpha|; we use a negated range expression. In this case,
the expression we need is ``'[^<]'``: match everything except
``'<'``. This range expression is then quantified with the ``'*'``
operator. In our revised example below, we use the improved regular
expression, and we also normalise whitespace, replacing any sequence
of one or more spaces, tabs or newlines (these are all matched by
``'\s+'``) with a single space character.

    >>> import re
    >>> page = re.sub('<[^>]*>', '', page)
    >>> page = re.sub('\s+', ' ', page)
    >>> print page[:60]
     BBC NEWS | News Front Page News Sport Weather World Service

You will probably find it useful to borrow the structure of this code
snippet for future tasks involving regular expressions: each time
through a series of substitutions, the result of operating on ``page``
gets assigned as the new value of ``page``. This approach allows us to
decompose the transformations we need into a series of simple regular
expression substitutions, each of which can be tested and debugged on
its own.


Extracting text from NLTK Corpora
---------------------------------


|NLTK| is distributed with several corpora and corpus samples and many
are supported by the ``corpora`` package.  Here we import
``gutenberg``, a selection of texts from the `Project Gutenberg
<http://www.gutenberg.org/>`_ electronic text archive, and list the
items it contains:

    >>> from nltk_lite.corpora import gutenberg
    >>> gutenberg.items
    ['austen-emma', 'austen-persuasion', 'austen-sense', 'bible-kjv',
    'blake-poems', 'blake-songs', 'chesterton-ball', 'chesterton-brown',
    'chesterton-thursday', 'milton-paradise', 'shakespeare-caesar',
    'shakespeare-hamlet', 'shakespeare-macbeth', 'whitman-leaves']

|nopar|
Next we iterate over the text content to find the number of word tokens:

    >>> count = 0
    >>> for word in gutenberg.raw('whitman-leaves'):
    ...     count += 1
    >>> print count
    154873

|NLTK| also includes the Brown Corpus, the first million-word,
part-of-speech tagged electronic corpus of English, created in 1961 at
Brown University.  Each of the sections ``a`` through ``r`` represents
a different genre.

    >>> from nltk_lite.corpora import brown
    >>> brown.items
    ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r']

|nopar|
We can extract individual sentences (as lists of words) from the
corpus using the ``extract()`` function. This is called below with
``0`` as an argument, indicating that we want the first sentence of
the corpus to be returned; ``1`` will return the second sentence, and
so on. ``brown.raw()`` is an iterator which gives us the words without
their part-of-speech tags.

    >>> from nltk_lite.corpora import extract
    >>> print extract(0, brown.raw())
    ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an',
    'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election',
    'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities',
    'took', 'place', '.']

Exercises
---------

#. |easy| Create a small text file, and write a program to read it and print it
   with a line number at the start of each line.

#. |easy| Use the corpus module to read ``austin-persuasion.txt``.
   How many word tokens does this book have?  How many word types?

#. |easy| Use the Brown corpus reader ``brown.raw()`` to access some sample
   text in two different genres.

#. |easy| Read in the texts of the *State of the Union* addresses, using the
   ``state_union`` corpus reader.  Count occurrences of ``men``, ``women``,
   and ``people`` in each document.  What has happened to the usage of these
   words over time?

#. |soso| Write a program to generate a table of token/type ratios, as we saw
   above.  Include the full set of Brown Corpus genres.  Use the
   dictionary ``brown.item_name`` to find out the genre of each
   section of the corpus.  Which genre
   has the lowest diversity (greatest number of tokens per type)?
   Is this what you would have expected?

#. |soso| Read in some text from a corpus, tokenize it, and print the list of
   all `wh`:lx:\ -word types that occur.  (`wh`:lx:\ -words in English
   are questions used in questions, relative clauses and
   exclamations: `who`:lx:, `which`:lx:, `what`:lx:, and so on.) Print
   them in order.  Are any words duplicated in this list, because of
   the presence of case distinctions or punctuation?

#. |soso| Examine the results of processing the URL
   ``http://news.bbc.co.uk/`` using the regular expressions suggested
   above. You will see that there is still a fair amount of
   non-textual data there, particularly Javascript commands. You may
   also find that sentence breaks have not been properly
   preserved. Define further regular expressions which improve the
   extraction of text from this web page.

#. |soso| Take a copy of the ``http://news.bbc.co.uk/`` over three different
   days, say at two-day intervals. This should give you three
   different files, ``bbc1.txt``, ``bbc2.txt`` and ``bbc3.txt``, each
   corresponding to a different snapshot of world events. Collect the
   100 most frequent word tokens for each file. What can you tell
   from the changes in frequency?

#. |soso| Define a function ``ghits()``, which takes a word as its argument, and
   builds a Google query string of the form ``http://www.google.com/search?q=word``.
   Strip the |HTML| markup and normalize whitespace.  Search for a substring
   of the form ``Results 1 - 10 of about``, followed by some number
   `n`:math:,  and extract `n`:math:.
   Convert this to an integer and return it.

#. |soso| Try running the various chatbots.  How
   *intelligent* are these programs?  Take a look at the program code
   and see if you can discover how it works.  You can find the code
   online at: ``http://nltk.sourceforge.net/lite/nltk_lite/chat/``.

------------------------------
Tokenization and Normalization
------------------------------

Tokenization, as we saw, is the task of extracting a sequence of elementary tokens
that constitute a piece of language data. In our first attempt to
carry out this task, we started off with a string of characters, and
used the ``split()`` method to break the string at whitespace
characters. (Recall that 'whitespace' covers not only interword space,
but also tabs and newlines.)
We pointed out that tokenization
based solely on whitespace is too simplistic for most applications.
In this section we will take a more sophisticated approach, using
regular expression to specify which character sequences should be
treated as words.  We will also consider important ways to normalize
tokens.

Tokenization with Regular Expressions
-------------------------------------

.. EL: I don't approve of defining a 'regexp' function inside a
   'regexp' module and then importing it into the parent package
   so that it's impossible to access the module. :-/

.. EL: I strongly object to having the regexp tokenizer need an
   advanced=True argument to deal with standard regexps.

The function ``tokenize.regexp()`` takes a text string and a
regular expression, and returns the list of substrings that match the
regular expression.  To define a tokenizer that includes punctuation
as separate tokens, we could do the following:

    >>> from nltk_lite import tokenize
    >>> text = '''Hello.  Isn't this fun?'''
    >>> pattern = r'\w+|[^\w\s]+'
    >>> list(tokenize.regexp(text, pattern))
    ['Hello', '.', 'Isn', "'", 't', 'this', 'fun', '?']

|nopar| 
The regular expression in this example will match a sequence
consisting of one or more word characters ``\w+``.  It will also match
a sequence consisting of one or more punctuation characters (or
non-word, non-space characters ``[^\w\s]+``).  This is another negated range
expression; it matches one or more characters which are not word
characters (i.e., not a match for ``\w``) and not a whitespace
character (i.e., not a match for ``\s``).  We use the disjunction
operator ``|`` to combine these into a single complex expression
``\w+|[^\w\s]+``.

There are a number of ways we might want to improve this regular
expression.  For example, it currently breaks `$22.50`:lx: into four
tokens; but we might want it to treat this as a single token.
Similarly, we would want to treat `U.S.A.` as a single token.  We can
deal with these by adding further clauses to the tokenizer's regular
expression.  For readability we break it up and insert comments, and
use the ``re.VERBOSE`` flag, so that Python knows to strip out the
embedded whitespace and comments.

    >>> import re
    >>> text = 'That poster costs $22.40.'
    >>> pattern = re.compile(r'''
    ...     \w+               # sequences of 'word' characters
    ...   | \$?\d+(\.\d+)?    # currency amounts, e.g. $12.50
    ...   | [\A\.]+           # abbreviations, e.g. U.S.A.
    ...   | [^\w\s]+          # sequences of punctuation
    ... ''', re.VERBOSE)
    >>> list(tokenize.regexp(text, pattern))
    ['That', 'poster', 'costs', '$22.40', '.']

It is sometimes more convenient to write a regular expression
matching the material that appears *between* tokens, such as whitespace
and punctuation.  The ``tokenize.regexp()`` function permits
an optional boolean parameter ``gaps``; when set to ``True`` the
pattern is matched against the gaps.  For example, here is how
``tokenize.whitespace()`` is defined:

    >>> list(tokenize.regexp(text, pattern=r'\s+', gaps=True))
    ['That', 'poster', 'costs', '$22.40.']

|nopar| Of course, we can also invoke the ``whitespace()`` method directly:

    >>> text = 'That poster costs $22.40.'
    >>> list(tokenize.whitespace(text))
    ['That', 'poster', 'costs', '$22.40.']

Lemmatization and Normalization
-------------------------------

Earlier we talked about counting word tokens, and completely ignored
the rest of the sentence in which these tokens appeared.  Thus, for an
example like `I saw the saw`:lx:, we would have treated both `saw`:lx:
tokens as instances of the same type.  However, one is a form of the
verb `see`:lx:, and the other is the name of a cutting instrument.
How do we know that these two forms of `saw`:lx: are unrelated?  One
answer is that as speakers of English, we know that these would appear
as different entries in a dictionary. Another, more empiricist, answer
is that if we looked at a large enough number of texts, it would
become clear that the two forms have very different distributions. For
example, only the noun `saw`:lx: will occur immediately after
determiners such as `the`:lx:.  Distinct words which have the same
written form are called `homographs`:dt:. We can distinguish
homographs with the help of context; often the previous word suffices.
We will explore this idea of context briefly, before addressing the
main topic of this section.

A `bigram`:dt: is simply a pair of words.
For example, in the sentence `She sells sea shells by the sea shore`:lx:,
the bigrams are
`She sells`:lx:,
`sells sea`:lx:,
`sea shells`:lx:,
`shells by`:lx:,
`by the`:lx:,
`the sea`:lx:,
`sea shore`:lx:.

As a first approximation to discovering the distribution of a word, we
can look at all the bigrams it occurs in.  Let's consider all bigrams
from the Brown Corpus which have the word `often`:lx: as first
element.  Here is a small selection, ordered by their counts::

    often ,		16
    often a		10
    often in		8
    often than		7
    often the		7
    often been		6
    often do		5
    often called  	4
    often appear	3
    often were		3
    often appeared	2
    often are		2
    often did		2
    often is		2
    often appears	1
    often call   	1

|nopar|
In the topmost entry, we see that `often`:lx: is frequently followed
by a comma.  This suggests that `often`:lx: is common at the end of
phrases. We also see that `often`:lx: precedes verbs, presumably as an
adverbial modifier.  We might infer from this that if we come across
`saw`:lx: in the context `often __`:lx:, then `saw`:lx: is being used
as a verb.

You will also see that this list includes different grammatical forms
of the same verb. We can form separate groups consisting of
`appear`:lx: |tilde| `appears`:lx: |tilde| `appeared`:lx:; `call`:lx:
|tilde| `called`:lx:; `do`:lx: |tilde| `did`:lx:; and and `been`:lx:
|tilde| `were`:lx: |tilde| `are`:lx: |tilde| `is`:lx:.  It is common
in linguistics to say that two forms such as `appear`:lx: and
`appeared`:lx: belong to a more abstract notion of a word called a
`lexeme`:dt:; by contast, `appeared`:lx: and `called`:lx: belong to
different lexemes. You can think of a lexeme as corresponding to an
entry in a dictionary, and a `lemma`:dt: as the headword for that entry.
By convention, small capitals are used when referring to a lexeme or lemma: `appear`:lex:.

Although `appeared`:lx: and `called`:lx: belong to different lexemes,
they do have something in common: they are both past tense forms. This
is signalled by the segment `-ed`:lx:, which we call a morphological
`suffix`:dt:. We also say that such morphologically complex forms are
`inflected`:dt:.  If we strip off the suffix, we get something called
the `stem`:dt:, namely `appear`:lx: and `call`:lx: respectively. While
`appeared`:lx:, `appears`:lx: and `appearing`:lx: are all
morphologically inflected, `appear`:lx: lacks any morphological
inflection and is therefore termed the `base`:dt: form. In English,
the base form is conventionally used as the `lemma`:dt: for a word.

..
   The actual word form that dictionaries use to index
   lexemes often varies by language. 
   While English uses the stem form of verbs, Latin uses the first person

Our notion of context would be more compact if we could group
different forms of the various verbs into their lemmas; then we could
study which verb lexemes are typically modified by a particular
adverb. `Lemmatization`:dt: |mdash| the process of mapping grammatical forms
into their lemmas |mdash| would yield the
following picture of the distribution of `often`:lx:.

  ::

    often ,		16
    often a		10
    often be		13
    often in		8
    often than		7
    often the		7
    often do		7
    often appear	6
    often call		5

Lemmatization is a rather sophisticated process which requires a
mixture of rules for regular inflections and table look-up for
irregular morphological patterns. Within |NLTK|, a simpler approach is
offered by the `Porter Stemmer`:dt: and the `Lancaster Stemmer`:dt:,
which strip inflectional suffixes from words,
collapsing the different forms of `appear`:lex: and `call`:lex:.
Given the simple nature of the stemming algorithms, you may not be
surprised to learn that this stemmer does not attempt to identify
`were`:lx: as a form of the lexeme `be`:lex:.  

    >>> from nltk_lite import stem
    >>> stemmer = stem.Porter()
    >>> verbs = ['appears', 'appear', 'appeared', 'calling', 'called']
    >>> stems = []
    >>> for verb in verbs:
    ...     stemmed_verb = stemmer.stem(verb)
    ...     if stemmed_verb not in stems:
    ...         stems.append(stemmed_verb)
    >>> stems
    ['appear', 'call']
    
Lemmatization and stemming can be regarded as special cases of
`normalization`:dt:.  They identify a canonical representative for a
group of related word forms.  By its nature, normalization collapses
distinctions. An example is case normalization, where all variants are
mapped into a single format. What counts as the normalized form will
vary according to context. Often, we convert everything into lower
case, so that words which were capitalized by virtue of being
sentence-initial are treated the same as those which occur elsewhere
in the sentence. The Python string method ``lower()`` will accomplish
this for us:

    >>> str = 'This is THE time'
    >>> str.lower()
    'this is the time'

We need to be careful, however; case normalization will also collapse the `New`:lx:
of `New York`:lx: with the `new`:lx: of `my new car`:lx:.

A final issue for normalization is the presence of contractions, such
as `didn't`:lx:.  If we are analyzing the meaning
of a sentence, it would probably be more useful to normalize this
form to two separate forms: `did`:lx: and `not`:lx:.

..
   Term variation is a particularly challenging factor in biomedical
   texts. For example, the following are just some of the possible
   variants for `nuclear factor kappa B`:lx:, the name for a family of
   proteins::

     nuclear factor kappa B, nuclear factor Kappa-B, nuclear factor kappaB,
     nuclear factor kB, NF-KB, NF-kb, NF-kB, NFKB, NFkB, NF kappa-B

   Although work is ongoing to standardize biomedical nomenclature, the
   rate at which new entities are discovered and described outstrips
   these efforts at present.

.. _sec-list-comprehensions:

Aside: List Comprehensions
--------------------------

Lemmatization and normalization involve applying the same operation to
each word token in a text.  `List comprehensions`:dt: are a convenient Python
construct for doing this.  Here we lowercase each word:

    >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
    >>> [word.lower() for word in sent]
    ['the', 'dog', 'gave', 'john', 'the', 'newspaper']
    
Here we rewrite the loop for identifying verb stems, from
the previous section:

    >>> [stemmer.stem(verb) for verb in verbs]
    ['appear', 'appear', 'appear', 'call', 'call']
    >>> set(stemmer.stem(verb) for verb in verbs)
    set(['call', 'appear'])

|nopar|
This syntax might be reminiscent of the notation used for building
sets, e.g. {(x,y) | x\ :superscript:`2` + y\ :superscript:`2` = 1}.
Just as this set definition incorporates a constraint, list
comprehensions can constrain the items they include.  In the next
example we remove all determiners from a list of words:

    >>> def is_lexical(word):
    ...     return word.lower() not in ('a', 'an', 'the', 'that', 'to')
    >>> [word for word in sent if is_lexical(word)]
    ['dog', 'gave', 'John', 'newspaper']

|nopar|
Now we can combine the two ideas, to pull out the content words
and normalize them.

    >>> [word.lower() for word in sent if is_lexical(word)]
    ['dog', 'gave', 'john', 'newspaper']

List comprehensions can build nested structures too.  For example,
the following code builds a list of tuples, where each tuple consists
of a word and its length.

    >>> sent = extract(0, brown.raw())
    >>> [(x, stemmer.stem(x).lower()) for x in sent]
    [('The', 'the'), ('Fulton', 'fulton'), ('County', 'counti'),
    ('Grand', 'grand'), ('Jury', 'juri'), ('said', 'said'), ('Friday', 'friday'),
    ('an', 'an'), ('investigation', 'investig'), ('of', 'of'),
    ("Atlanta's", "atlanta'"), ('recent', 'recent'), ('primary', 'primari'),
    ('election', 'elect'), ('produced', 'produc'), ('``', '``'), ('no', 'no'),
    ('evidence', 'evid'), ("''", "''"), ('that', 'that'), ('any', 'ani'),
    ('irregularities', 'irregular'), ('took', 'took'), ('place', 'place'), ('.', '.')]

Exercises
---------

1. |easy| **Regular expression tokenizers:**
   Save some text into a file ``corpus.txt``.  Define a function ``load(f)``
   that reads from the file named in its sole argument, and returns a string
   containing the text of the file.

   a) Use ``tokenize.regexp()`` to create a tokenizer which tokenizes
      the various kinds of punctuation in this text.  Use a single
      regular expression, with inline comments using the
      ``re.VERBOSE`` flag.
   
   b) Use ``tokenize.regexp()`` to create a tokenizer which tokenizes
      the following kinds of expression: monetary amounts; dates; names
      of people and companies.

#. |easy| Rewrite the following loop as a list comprehension:

    >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
    >>> result = []
    >>> for word in sent:
    ...     word_len = (word, len(word))
    ...     result.append(word_len)
    >>> result
    [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]

#. |soso| Use the Porter Stemmer to normalize some tokenized text, calling
   the stemmer on each word.

#. |soso| Readability measures are used to score the reading difficulty of a
   text, for the purposes of selecting texts of appropriate difficulty
   for language learners.  For example, the Automated Readability Index (ARI)
   of a text is defined to be:
   ``4.71 *`` |mu|\ :subscript:`w` ``+ 0.5 *`` |mu|\ :subscript:`s` ``- 21.43``,
   where |mu|\ :subscript:`w` is the mean word length (in letters), and
   where |mu|\ :subscript:`s` is the mean sentence length (in words).
   With the help of your word and sentence
   tokenizers, compute the ARI scores for a collection of texts.

#. |hard| Rewrite the following nested loop as a nested list comprehension:

    >>> words = ['attribution', 'confabulation', 'elocution',
    ...          'sequoia', 'tenacious', 'unidirectional']
    >>> vsequences = set()
    >>> for word in words:
    ...     vowels = []
    ...     for char in word:
    ...         if char in 'aeiou':
    ...             vowels.append(char)
    ...     vsequences.add(vowels)
    >>> sorted(vsequences)
    ['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']    

.. sorted(set(''.join(c for c in word if c in 'aeiou') for word in words))

#. |hard| **Sentence tokenizers:**
   Develop a sentence tokenizer.  Test it on the Brown Corpus, which
   has been grouped into sentences.

------------------------------
Lexical Resources (INCOMPLETE)
------------------------------

[This section will contain a discussion of lexical resources,
focusing on Wordnet, but also including the ``cmudict`` and ``timit``
corpus readers.]

Pronunciation Dictionary
------------------------

Here we access the pronunciation of words...

.. doctest-ignore::
    >>> from nltk_lite.corpora import cmudict
    >>> from string import join
    >>> for word, num, pron in cmudict.raw():
    ...     if pron[-4:] == ('N', 'IH0', 'K', 'S'):
    ...         print word.lower(),
    atlantic's audiotronics avionics beatniks calisthenics centronics
    chetniks clinic's clinics conics cynics diasonics dominic's
    ebonics electronics electronics' endotronics endotronics' enix
    environics ethnics eugenics fibronics flextronics harmonics
    hispanics histrionics identics ionics kibbutzniks lasersonics
    lumonics mannix mechanics mechanics' microelectronics minix minnix
    mnemonics mnemonics molonicks mullenix mullenix mullinix mulnix
    munich's nucleonics onyx panic's panics penix pennix personics
    phenix philharmonic's phoenix phonics photronics pinnix
    plantronics pyrotechnics refuseniks resnick's respironics sconnix
    siliconix skolniks sonics sputniks technics tectonics tektronix
    telectronics telephonics tonics unix vinick's vinnick's vitronics

WordNet Semantic Network
------------------------

`Wordnet`:topic:

.. note:: Before using Wordnet it must be installed on your machine,
          along with NLTK version 0.8
          Please see the instructions on the |NLTK| website.
          Help on the ``wordnet`` interface is available using
          ``help(wordnet)``. 

Consider the following sentence:

.. _carex1:
.. ex::
   Benz is credited with the invention of the motorcar.

If we replace `motorcar`:lx: in carex1_ by `automobile`:lx:, the
meaning of the sentence stays pretty much the same:

.. _carex2:
.. ex::
   Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we can
conclude that the words `motorcar`:lx: and `automobile`:lx: have the
same meaning. More technically, we say that they are `synonyms`:dt:. 

`Wordnet`:idx: is a semantically-oriented dictionary which will
allow us to find the set of synonyms |mdash| or `synset`:dt: |mdash|
for any word. However, in order to look up the senses of a word, we need to pick
a part of speech for the word.  Wordnet contains four dictionaries: ``N``
(nouns), ``V`` (verbs), ``ADJ`` (adjectives), and ``ADV``
(adverbs). To simplify our discussion, we will focus on the ``N``
dictionary here.  Let's look up `motorcar`:lx: in the ``N`` dictionary.
    
    >>> from nltk_lite.wordnet import *
    >>> car = N['motorcar']
    >>> car
    motorcar (noun)

The variable ``car`` is now bound to a ``Word`` object.
Words will often have more than sense, where each
sense is represented by a synset. However,
`motorcar`:lx: only has one sense in Wordnet, as we can discover by
checking the length of ``car``. We can then find the synset (a set of
lemmas), the words it contains, and a gloss.

    >>> len(car)
    1
    >>> car[0]
    {noun: car, auto, automobile, machine, motorcar}
    >>> [word for word in car[0]]
    ['car', 'auto', 'automobile', 'machine', 'motorcar']
    >>> car[0].gloss
    'a motor vehicle with four wheels; usually propelled by an
    internal combustion engine; 
    "he needs a car to get to work"'

The ``wordnet`` module also defines ``Synset``\ s.
Let's look at a word which is `polysemous`:dt:\ ;
that is, which has multiple synsets:

    >>> poly = N['pupil']
    >>> for synset in poly:
    ...     print synset
    {noun: student, pupil, educatee}
    {noun: schoolchild, school-age child, pupil}
    {noun: pupil}
    >>> poly[2].gloss
    'the contractile aperture in the center of the iris of the eye;
    resembles a large black dot'

We can think of synsets as being concrete manifestations of the
concepts that are `lexicalized`:dt: by words in a particular
language. In principle, we can postulate concepts that have no lexical
realization in English. Wordnet builds on a tradition which claims that
concepts are linked together in a hierarchy. Some concepts are very
general, such as *Entity*, *State*, *Event* |mdash| these are called
`unique beginners`:dt: in Wordnet. Others, such as *gas guzzler* and
*hatchback*, are much more specific. A small portion of a concept
hierarchy is illustrated in Figure wn-hierarchy_. The edges between nodes
indicate the hypernym/hyponym relation; the dotted line at the top is
intended to indicate that *artefact* is non-immediate hypernym of
*motorcar*. 

.. _wn-hierarchy:
.. figure:: ../images/wordnet-hierarchy.png
   :scale: 15

   Fragment of Wordnet Concept Hierarchy

Wordnet has been designed to make it easy to navigate between
concepts. For example, given a concept like
*motorcar*, we can look at the concepts which are more specific;
these are usually called `hyponyms`:dt:. Here is one way to carry out this
navigation:

    >>> for concept in car[0][HYPONYM][:10]:
    ... 	print concept
    {noun: ambulance}
    {noun: beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon}
    {noun: bus, jalopy, heap}
    {noun: cab, hack, taxi, taxicab}
    {noun: compact, compact car}
    {noun: convertible}
    {noun: coupe}
    {noun: cruiser, police cruiser, patrol car, police car, prowl car, squad car}
    {noun: electric, electric automobile, electric car}
    {noun: gas guzzler}

We can also move up the hierarchy, by looking at broader concepts than
*motorcar*. The ``wordnet`` package offers a shortcut for finding the
immediate `hypernym`:dt: of a concept:

    >>> car[0][HYPERNYM]
    [{noun: motor vehicle, automotive vehicle}]

Of course, we can also look for the hypernyms of hypernyms; from any
synset we can trace (multiple) paths back to a unique beginner.
Synsets have a method ``tree()`` which produces a nested list
structure.

    >>> from pprint import pprint
    >>> pprint(N['car'][0].tree(HYPERNYM))
    [{noun: car, auto, automobile, machine, motorcar},
     [{noun: motor_vehicle, automotive_vehicle},
      [{noun: self-propelled_vehicle},
       [{noun: wheeled_vehicle},
        [{noun: vehicle},
         [{noun: conveyance, transport},
          [{noun: instrumentality, instrumentation},
           [{noun: artifact, artefact},
            [{noun: whole, unit},
             [{noun: object, physical_object},
              [{noun: physical_entity}, [{noun: entity}]]]]]]]],
        [{noun: container},
         [{noun: instrumentality, instrumentation},
          [{noun: artifact, artefact},
           [{noun: whole, unit},
            [{noun: object, physical_object},
             [{noun: physical_entity}, [{noun: entity}]]]]]]]]]]]

A related method ``closure()`` produces a flat version of this structure,
with any repeats eliminated.
Both of these functions take an optional ``depth`` argument which permit
us to limit the number of steps to take, which is important when using
unbounded relations like ``SIMILAR``.
Table wordnet-rel_ lists the most important lexical relations supported
by Wordnet.  See ``dir(wordnet)`` for a full list.

.. table:: wordnet-rel

   ===========  ================  ==============================================
   Hypernym     more general      `animal`:lx: is a hypernym of `dog`:lx:
   Hyponym      more specific     `dog`:lx: is a hypernym of `animal`:lx:
   Meronym      part of           `door`:lx: is a meronym of `house`:lx:
   Holonym      has part          `house`:lx: is a holonym of `door`:lx:
   Synonym      similar meaning   `car`:lx: is a synonym of `automobile`:lx:
   Antonym      opposite meaning  `like` is an antonym of `dislike`:lx:
   Entailment   necessary action  `step` is an entailment of `walk`:lx:
   ===========  ================  ==============================================

   Major WordNet Lexical Relations

Recall that we can iterate over the words of a synset, with ``for word in synset``.
We can also test if a word is in a dictionary, e.g. ``of word in V``.
Let's put these together to find "animal words" that are used as verbs.
Since there are a lot of these, we will cut this off at depth 4.

    >>> animals = N['animal'][0].closure(HYPONYM, depth=4)
    >>> [word for synset in animals for word in synset if word in V]
    ['pet', 'stunt', 'prey', 'quarry', 'game', 'mate', 'head', 'dam',
     'sire', 'steer', 'orphan', 'spat', 'sponge', 'worm', 'grub', 'baby',
     'pup', 'whelp', 'cub', 'kit', 'kitten', 'foal', 'lamb', 'fawn',
     'bird', 'grouse', 'stud', 'hog', 'fish', 'cock', 'parrot', 'frog',
     'beetle', 'bug', 'bug', 'queen', 'leech', 'snail', 'slug', 'clam',
     'cockle', 'oyster', 'scallop', 'scollop', 'escallop', 'quail']

WordNet Similarity
------------------

It is often useful to be able to tell whether two lexical concepts are
`semantically related`:dt:. For example, in order to check whether a
particular instance of the word `bank`:lx: means
*financial institution*, we can count the number of nearby words
that are semantically related to this sense.  Using WordNet, we can
investigate whether semantic relatedness can be
expressed in terms of the graph structure of the concept
hierarchy. More specifically, we would expect that the semantic
relatedness of two concepts correlates with the length of the path
between them. The ``wordnet`` package includes a variety of measures
which incorporate this basic insight.  For example,
``path_similarity`` assigns a score in the range 0\ |ndash|\
1, based on the shortest path that connects the concepts in the hypernym
hierarchy (-1 is returned in those cases where a path cannot be
found).  A score of 1 represents identity, i.e., comparing a sense with
itself will return 1.
    
    >>> from nltk_lite.wordnet import *
    >>> N['poodle'][0].path_similarity(N['dalmatian'][1])
    0.33333333333333331
    >>> N['dog'][0].path_similarity(N['cat'][0])
    0.20000000000000001
    >>> V['run'][0].path_similarity(V['walk'][0])
    0.25
    >>> V['run'][0].path_similarity(V['think'][0])
    -1

Several other similarity measures are provided in ``nltk_lite.wordnet``:
Leacock-Chodorow, Wu-Palmer, Resnik, Jiang-Conrath, and Lin.
For a detailed comparison of various measures, see [Budanitsky2006EWB]_.

Exercises
---------

1. |easy| Familiarize yourself with the Wordnet interface, by reading the
   documentation available via ``help(wordnet)``.

#. |easy| Investigate the holonym / meronym pointers for some nouns.  Note that there
   are three kinds (member, part, substance), so access is more specific,
   e.g., ``MEMBER_MERONYM``, ``SUBSTANCE_HOLONYM``.

#. |soso| Write a program to score the similarity of two nouns as the depth
   of their first common hypernym.
   
#. |hard| Use one of the predefined similarity measures to score
   the similarity of each of the following pairs of words.
   Rank the pairs in order of decreasing similarity.
   How close is your ranking to the order given here?
   (Note that this order was established experimentally
   by [MillerCharles1998]_.)
   
::
      car-automobile, gem-jewel, journey-voyage, boy-lad,
      coast-shore, asylum-madhouse, magician-wizard, midday-noon,
      furnace-stove, food-fruit, bird-cock, bird-crane, tool-implement,
      brother-monk, lad-brother, crane-implement, journey-car,
      monk-oracle, cemetery-woodland, food-rooster, coast-hill,
      forest-graveyard, shore-woodland, monk-slave, coast-forest,
      lad-wizard, chord-smile, glass-magician, rooster-voyage, noon-string.

-----------------------------
Simple Statistics with Tokens
-----------------------------

Example: Stylistics
-------------------

So far, we've seen how to count the number of tokens or types in a
document.  But it's much more interesting to look at *which* tokens or
types appear in a document.  We can use a Python dictionary to count the
number of occurrences of each word type in a document:

    >>> counts = {}
    >>> for word in text.split():
    ...     if word not in counts:
    ...         counts[word] = 0
    ...     counts[word] += 1

The first statement, ``counts = {}``, initializes the dictionary,
while the next four lines successively add entries to it and increment
the count each time we encounter a new token of a given type.  To view
the contents of the dictionary, we can iterate over its keys and print
each entry (here just for the first 10 entries):
    
    >>> for word in sorted(counts)[:10]:
    ...     print counts[word], word
    1 $1.1
    2 $130
    1 $36
    1 $45
    1 $490
    1 $5
    1 $62.625,
    1 $620
    1 $63
    2 $7

We can also print the number of times that a specific word we're
interested in appeared:

    >>> print counts['might']
    3

Applying this same approach to document collections that are
categorized by genre, we can learn something about the patterns of word
usage in those genres.  For example, Table brown-modals_ was
constructed by counting the number of times various modal words appear
in different genres in the Brown Corpus:

.. table:: brown-modals

   ==================  ===  =====  ===  =====  ====  ====
   Genre               can  could  may  might  must  will 
   ==================  ===  =====  ===  =====  ====  ====
   skill and hobbies   273  59     130  22     83    259 
   humor               17   33     8    8      9     13 
   fiction: science    16   49     4    12     8     16 
   press: reportage    94   86     66   36     50    387 
   fiction: romance    79   195    11   51     46    43 
   religion            84   59     79   12     54    64 
   ==================  ===  =====  ===  =====  ====  ====

   Use of Modals in Brown Corpus, by Genre

|nopar|
Observe that the most frequent modal in the reportage genre is
`will`:lx:, suggesting a focus on the future, while the most frequent
modal in the romance genre is `could`:lx:, suggesting a focus on
possibilities.

We can also measure the lexical diversity of a genre, by
calculating the ratio of word types and word tokens, as
shown in Table brown-types_.  (Genres with lower diversity
have a higher number of tokens per type.)

.. table:: brown-types

   ==================  ===========  ==========  =====
   Genre               Token Count  Type Count  Ratio
   ==================  ===========  ==========  =====
   skill and hobbies   82345        11935       6.9
   humor               21695        5017        4.3
   fiction: science    14470        3233        4.5
   press: reportage    100554       14394       7.0
   fiction: romance    70022        8452        8.3
   religion            39399        6373        6.2
   ==================  ===========  ==========  =====

   Word Types and Tokens in Brown Corpus, by Genre

We can carry out a variety of interesting explorations simply by
counting words.  In fact, the field of *Corpus Linguistics* focuses
almost exclusively on creating and interpreting such tables of word
counts.  So far, our method for identifying word tokens has been
a little primitive, and we have not been able to separate punctuation
from the words.  We will take up this issue in the next section.

Example: Lexical Dispersion
---------------------------

Word tokens vary in their distribution throughout a text.  We can
visualize word distributions, to get an overall sense of topics and topic
shifts.  For example, consider the pattern of mention of the main
characters in Jane Austen's *Sense and Sensibility*: Elinor, Marianne,
Edward and Willoughby.  The following plot contains four rows, one for
each name, in the order just given.  Each row contains a series of lines, drawn
to indicate the position of each token.

.. figure:: ../images/words-dispersion.png

   Lexical Dispersion

As you can see, `Elinor`:lx: and `Marianne`:lx: appear rather uniformly throughout
the text, while `Edward`:lx: and `Willoughby`:lx: tend to appear separately.
Here is the program that generated the above plot.  [NB. Requires |NLTK|-Lite 0.6.7].

.. doctest-ignore::
    >>> from nltk_lite.corpora import gutenberg
    >>> from nltk_lite.draw import dispersion
    >>> words = ['Elinor', 'Marianne', 'Edward', 'Willoughby']
    >>> dispersion.plot(gutenberg.raw('austen-sense'), words)

Frequency Distributions
-----------------------

We can do more sophisticated counting using *frequency distributions*.
Abstractly, a frequency distribution is a record of the number of
times each *outcome* of an *experiment* has occurred.  For instance, a
frequency distribution could be used to record the frequency of each
word in a document (where the "experiment" is examining a word, and
the "outcome" is the word's type).  Frequency distributions are
generally created by repeatedly running an experiment, and
incrementing the count for a sample every time it is an outcome of the
experiment.  The following program produces a frequency distribution
that records how often each word type occurs in a text.  It increments
a separate counter for each word, and prints the most frequently
occurring word:

    >>> from nltk_lite.probability import FreqDist
    >>> from nltk_lite.corpora import genesis
    >>> fd = FreqDist()
    >>> for token in genesis.raw():
    ...     fd.inc(token)
    >>> fd.max()
    'the'

|nopar|
Once we construct a frequency distribution that records
the outcomes of an experiment, we can use it to examine a number
of interesting properties of the experiment.  Some of these properties
are summarized in Table freqdist_.

.. table:: freqdist

   ==========  ===================  ===========================================
   Name        Sample               Description
   ==========  ===================  ===========================================
   Count       ``fd.count('the')``  number of times a given sample occurred
   Frequency   ``fd.freq('the')``   frequency of a given sample
   N           ``fd.N()``           number of samples
   Samples     ``fd.samples()``     list of distinct samples recorded
   Max         ``fd.max()``         sample with the greatest number of outcomes
   ==========  ===================  ===========================================

   Frequency Distribution Module

We can also use a ``FreqDist`` to examine the distribution of word lengths
in a corpus.  For each word, we find its length, and increment the
count for words of this length.

    >>> def length_dist(text):
    ...     fd = FreqDist()                        # initialize frequency distribution
    ...     for token in genesis.raw(text):        # for each token
    ...         fd.inc(len(token))                 # found a word with this length
    ...     for i in range(1,15):                  # for each length from 1 to 14
    ...         print "%2d" % int(100*fd.freq(i)), # print the percentage of words with this length
    ...     print

|nopar|
Now we can call ``length_dist`` on a text to print the distribution of
word lengths.  We see that the most frequent word length for the English
sample is 3 characters, while the most frequent length for the Finnish
sample is 5-6 characters.

    >>> length_dist('english-kjv')
     2 14 28 21 13  7  5  2  2  0  0  0  0  0
    >>> length_dist('finnish')
     0  9  6 10 16 16 12  9  6  3  2  2  1  0

.. TODO: Plot interface

Conditional Frequency Distributions
-----------------------------------

A *condition* specifies the context in which an experiment is
performed.  Often, we are interested in the effect that conditions
have on the outcome for an experiment.  A *conditional frequency
distribution* is a collection of frequency distributions for the same
experiment, run under different conditions.  For example, we might
want to examine how the distribution of a word's length (the outcome)
is affected by the word's initial letter (the condition).

    >>> from nltk_lite.corpora import genesis
    >>> from nltk_lite.probability import ConditionalFreqDist
    >>> cfdist = ConditionalFreqDist()
    >>> for text in genesis.items:
    ...     for word in genesis.raw(text):
    ...         cfdist[word[0]].inc(len(word))
  
|nopar|
To plot the results, we construct a list of points, where the *x*
coordinate is the word length, and the *y* coordinate is the frequency
with which that word length is used:

    >>> for cond in cfdist.conditions():
    ...     wordlens = cfdist[cond].samples()
    ...     wordlens.sort()
    ...     points = [(i, cfdist[cond].freq(i)) for i in wordlens]

|nopar|
We can plot these points using the ``Plot`` function defined in
``nltk_lite.draw.plot``, as follows: 

.. doctest-ignore::
   >>> Plot(points).mainloop()

Predicting the Next Word
------------------------

Conditional frequency distributions are often used for prediction.
*Prediction* is the problem of deciding a likely outcome for a given
run of an experiment.  The decision of which outcome to predict is
usually based on the context in which the experiment is performed.
For example, we might try to predict a word's text (outcome), based on
the text of the word that it follows (context).

To predict the outcomes of an experiment, we first examine a
representative *training corpus*, where the context and outcome for
each run of the experiment are known.  When presented with a new run
of the experiment, we simply choose the outcome that occurred most
frequently for the experiment's context.

We can use a ``ConditionalFreqDist`` to find the most frequent
occurrence for each context.  First, we record each outcome in the
training corpus, using the context that the experiment was run under
as the condition.  Then, we can access the frequency distribution for
a given context with the indexing operator, and use the ``max()``
method to find the most likely outcome.

We will now use a ``ConditionalFreqDist`` to predict the most likely
next word in a text.  To begin, we load a corpus from a text file, and
create an empty ``ConditionalFreqDist``::

    >>> from nltk_lite.corpora import genesis
    >>> from nltk_lite.probability import ConditionalFreqDist
    >>> cfdist = ConditionalFreqDist()

|nopar|
We then examine each token in the corpus, and increment the
appropriate sample's count.  We use the variable ``prev`` to record
the previous word.

    >>> prev = None
    >>> for word in genesis.raw():
    ...     cfdist[prev].inc(word)
    ...     prev = word

.. Note:: Sometimes the context for an experiment is unavailable, or
   does not exist.  For example, the first token in a text does not
   follow any word.  In these cases, we must decide what context to
   use.  For this example, we use ``None`` as the context for the
   first token.  Another option would be to discard the first token.

.. TODO: refer back to bigram discussion

|nopar|
Once we have constructed a conditional frequency distribution for the
training corpus, we can use it to find the most likely word for any
given context. For example, taking the word `living`:lx: as our context,
we can inspect all the words that occurred in that context.

    >>> word = 'living'
    >>> cfdist[word].samples()
    ['creature,', 'substance', 'soul.', 'thing', 'thing,', 'creature']

We can set up a simple loop to generate text: we set an initial
context, picking the most likely token in that context as our next
word, and then using that word as our new context:

    >>> word = 'living'
    >>> for i in range(20):
    ...     print word,
    ...     word = cfdist[word].max()
    living creature that he said, I will not be a wife of the land
    of the land of the land

|nopar|
This simple approach to text generation tends to get stuck in loops,
as demonstrated by the text generated above.  A more advanced approach
would be to randomly choose each word, with more frequent words chosen
more often.

Exercises
---------

#. |easy| Pick a text, and explore the dispersion of particular words.  What
   does this tell you about the words, or the text?

#. |easy| Use the ``Plot`` function defined in ``nltk_lite.draw.plot`` plot
   word-initial character against word length, as discussed in this section.

#. |soso| Write a program to create a table of word frequencies by genre,
   like the one given above for modals.  Choose your own words and
   try to find words whose presence (or absence) is typical of a genre.
   Discuss your findings.

#. |soso| **Zipf's Law**:
   Let *f(w)* be the frequency of a word *w* in free text. Suppose that
   all the words of a text are ranked according to their frequency,
   with the most frequent word first. Zipf's law states that the
   frequency of a word type is inversely proportional to its rank
   (i.e. *f.r = k*, for some constant *k*). For example, the 50th most
   common word type should occur three times as frequently as the
   150th most common word type.

   a) Write a function to process a large text and plot word
      frequency against word rank using the ``nltk_lite.draw.plot`` module. Do
      you confirm Zipf's law? (Hint: it helps to use logarithmic axes,
      by including ``scale='log'`` as a second argument to ``Plot()``).
      What is going on at the extreme ends of the plotted line?

   #) Generate random text, e.g. using ``random.choice("abcdefg ")``,
      taking care to include the space character.  You will need to
      ``import random`` first.  Use the string
      concatenation operator to accumulate characters into a (very)
      long string.  Then tokenize this string, and generate the Zipf
      plot as before, and compare the two plots.  What do you make of
      Zipf's Law in the light of this?

#. |soso| **Predicting the next word**: The word prediction program we saw in
   this chapter quickly gets stuck in a cycle.  Modify the program to
   choose the next word randomly, from a list of the *n* most likely
   words in the given context.  (Hint: store the *n* most likely words in
   a list ``lwords`` then randomly choose a word from the list using
   ``random.choice()``.)

   a) Select a particular genre, such as a section of the Brown Corpus,
      or a genesis translation, or one of the Gutenberg texts.  Train
      your system on this corpus and get it to generate random text.  You
      may have to experiment with different start words. How intelligible
      is the text?  Discuss the strengths and weaknesses of this method of
      generating random text.

   #) Try the same approach with different genres, and with different
      amounts of training data.  What do you observe?

   #) Now train your system using two distinct genres and experiment
      with generating text in the hybrid genre.  As before, discuss your
      observations.

#. |soso| **Exploring text genres:**
   Investigate the table of modal distributions and look for other patterns.
   Try to explain them in terms of your own impressionistic understanding
   of the different genres.  Can you find other closed classes of words that
   exhibit significant differences across different genres?

#. |hard| **Authorship identification:**
   Reproduce some of the results of [Zhao07]_.

#. |hard| **Gender-specific lexical choice:**
   Reproduce some of the results of ``http://www.clintoneast.com/articles/words.php``

..
   ---------------
   What is a Word?
   ---------------

   Linguists have devoted considerable effort to distinguishing different
   ways in which the term `word`:dt: is used. So far we have focused on
   `orthographic words`:dt: strings of characters which can be separated
   by various textual criteria such as the presence of whitespace.
   However, for linguists, the spoken word is often regarded as
   primary, in the sense that natural languages have all evolved
   as spoken mediums, and children learn to speak long before they can
   write and read (if indeed they ever do). Subjectively, we hear spoken
   utterances as a succession of words, but it is rarely the case that
   there are perceptible gaps between spoken words in conversational
   speech. Nevertheless, spoken words can and do occur in
   isolation. Using a variety of (sometime language-dependent) criteria,
   certain phonological units are classed as `phonological words`, and
   these units need not correspond to orthographic words: an example is
   the utterance `wanna`:lx: as a contraction of the words `want`:lx: and
   `to`:lx:.  Returning to an example mentioned before, note that the
   form `n't`:lx: in `didn't`:lx: cannot form a phonological word by
   itself, and is sometimes called a `clitic`:dt: or `leaner`:dt:; it needs to to
   combine with a "host" word before it can be uttered in normal speech.

   .. TODO: mention following distinction earlier!

   Independent of the distinction between spoken and written language,
   an important notion is that of a `lexeme`:dt: or `lexical item`:dt:. This
   corresponds broadly to the notion of a word that you might look up in
   a dictionary of English or some other language. For example, in order
   to find the meaning of the word `said`:lx:, you need to know first that
   it is a particular `grammatical form`:dt: of the lexeme `SAY`:lx:. (We adopt
   the standard convention of representing lexemes with upper-case
   forms.) Similarly,  `say`:lx:, `says`:lx: and `saying`:lx: are also
   grammatical forms of `SAY`:lx:. While `said`:lx:, `says`:lx: and `saying`:lx:
   are morphologically inflected, `say`:lx: lacks any morphological
   inflection and is therefore termed the `base`:dt: form. In English, the
   base form is conventionally used as the `lemma`:dt: (or `citation form`:dt:)
   for a word. It is the lemma that is chosen to represent the
   corresponding lexeme. (For example, the main entry for lexical item
   will be listed under the word's lemma.)

   Many of the word-like forms that occur in text have received little
   attention from linguists but are nevertheless so prevalent that they
   need to be dealt with in many NLP applications. These include
   abbreviations such as `Dept.`:lx: or `Mr.`:lx: and acronyms such as `NATO`:lx:,
   `Interpol`:lx: or `SQL`:lx:. Also of interest are symbols such as $ (`dollar`:lx:)
   and @ (`at`:lx:). Unlike ordinary punctuation marks, these symbols
   (sometimes called `logograms`:dt:) stand for words, and would be spoken as
   such if the text were read aloud.

----------
Conclusion
----------

In this chapter we saw that we can do a variety of interesting
language processing tasks that focus solely on words.  Tokenization
turns out to be far more difficult than expected.  Other kinds of
tokenization, such as sentence tokenization, are left for the
exercises.  No single solution works well across-the-board, and we
must decide what counts as a token depending on the application
domain.  We also looked at normalization (including lemmatization) and
saw how it collapses distinctions between tokens.  In the next chapter
we will look at word classes and automatic tagging.
  
---------------
Further Reading
---------------

..
   John Hopkins Center for Language and Speech Processing, 1999
   Summer Workshop on Normalization of Non-Standard Words: Final Report
   http://www.clsp.jhu.edu/ws99/projects/normal/report.pdf

.. include:: footer.txt
