DNA {mlbench}R Documentation

Primate splice-junction gene sequences (DNA)

Description

It consists of 3,186 data points (splice junctions). The data points are described by 180 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out).

The StaLog dna dataset is a processed version of the Irvine database described below. The main difference is that the symbolic variables representing the nucleotides (only A,G,T,C) were replaced by 3 binary indicator variables. Thus the original 60 symbolic attributes were changed into 180 binary attributes. The names of the examples were removed. The examples with ambiguities were removed (there was very few of them, 4). The StatLog version of this dataset was produced by Ross King at Strathclyde University. For original details see the Irvine database documentation.

The nucleotides A,C,G,T were given indicator values as follows:

A -> 1 0 0
C -> 0 1 0
G -> 0 0 1
T -> 0 0 0

Hint. Much better performance is generally observed if attributes closest to the junction are used. In the StatLog version, this means using attributes A61 to A120 only.

Usage

data("DNA", package = "mlbench")

Format

A data frame with 3,186 observations on 180 variables, all nominal and a target class.

Source

These data have been taken from:

and were converted to R format by Evgenia Dimitriadou.

References

Noordewier MO, Towell GG, Shavlik JW (1990). “Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences.” In Proceedings of the 4th International Conference on Neural Information Processing Systems, series NIPS'90, 530–536. ISBN 1558601848. Towell GG (1991). Symbolic Knowledge and Neural Networks: Insertion, Refinement, and Extraction. Ph.D. thesis, University of Wisconsin Madison. Towell GG, Craven MW, Shavlik JW (1991). “Constructive Induction in Knowledge-Based Neural Networks.” In Birnbaum LA, Collins GC (eds.), Machine Learning Proceedings 1991, 213–217. Morgan Kaufmann, San Francisco (CA). ISBN 978-1-55860-200-7. doi:10.1016/B978-1-55860-200-7.50046-5. Towell GG, Shavlik JW (1991). “Interpretation of Artificial Neural Networks: Mapping Knowledge-Based Neural Networks into Rules.” In Proceedings of the 5th International Conference on Neural Information Processing Systems, series NIPS'91, 977–984. ISBN 1558602224.

Examples

data("DNA", package = "mlbench")
summary(DNA)

[Package mlbench version 2.1-7 Index]