Bayesian Noise Reduction - Contextual Symmetry Logic
http://bnr.nuclearelephant.com
Copyright (c) 2004 Jonathan A. Zdziarski
v2.0

LICENSE
                                                                                
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
                                                                                
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
                                                                                
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

ABOUT BAYESIAN NOISE REDUCTION

Modern day language classification requires the use of machine learning, which 
relies heavily on presented learning input. Most of today's algorithms (Bayes, 
Chi-Square, etcetera) are inherently sound and accurate, however regardless of
which algorithm is used, a great deal of the algorithm's accuracy is related 
directly to the quality of data provided - the Garbage In, Garbage Out theory. 
Bayesian Noise Reduction is a statistical approach to evaluating coherence 
using pattern consistency checking. BNR attempts to solve the problem commonly 
referred to as "Bayesian Noise" which, in its simplest definition, refers to 
irrelevant or incoherent data present in a message being classified. Bayesian 
Noise Reduction dubs this text in order to provide cleaner classification and 
is implemented as a "pre-filter" to existing language classification functions. 

libbnr is an implementation of the Bayesian Noise Reduction (BNR) algorithm
which I originally designed to counter directed attacks in spam. As Dr. John 
Graham-Cumming illustrated at Spam conference 2004, most statistical language 
classifiers are quite resilient to random word attacks, but fail miserably 
against directed attacks where the spammer has mined intelligence about the
target user(s) and purposely injected text that is context-specific to the
target, which can fool spam filters into believing the message is legitimate.

Come to find, after writing version 2.0 of the algorithm, it was quite
efficient at filtering out all types of noise from all types of text samples.
Whether you're writing a spam filter, document classifier, or performing some
type of Bayesian intrusion detection, the noise reduction library can help to
improve the quality of your classifications.

A full explanation of the algorithm can be found in my white paper at
http://bnr.nuclearelephant.com. In simple terms, the BNR algorithm uses
pattern consistency checking to identify IN-consistent data. The library
requires two different sets of input from the implementor:

1. A stream of _ordered_ tokens (words, nGrams, etcetera) and their associated
   p-values (probabilities). This could be a message body or other input.

2. After a call to bnr_instantiate(), a set of patterns will be instantiated.
   These patterns must also be tracked in the classifier (according to
   the white paper, which treats them similar to any other token) and their
   probabilities must also be fed into the noise reduction context.

Once both pieces of data have been provided, the noise reduction algorithm
will perform its analysis and provide an output stream of what's left over.

libbnr can be linked in with your classifier and called using the standard
C interface. An example has been provided (example.c) to show developers how
to integrate the tool properly.

One final note, if your classifier implements nGrams, it is usually best to
create a separate BNR context and process each set of nGrams separately. One
stream for single tokens and another for biGrams, etc., will yield the best 
results.

BUILDING

./configure && make && make install

LINKING

Compile your application with -lbnr

CODING

See example.c for more information

Jonathan A. Zdziarski
jonathan@nuclearelephant.com
