IN: machine-learning.data-sets
TUPLE: data-set data target target-names description
-feature-names ;
+ feature-names ;
C: <data-set> data-set
"resource:extra/machine-learning/data-sets/" prepend
utf8 file-contents ;
+: load-tabular-file ( name -- lines )
+ load-file [ blank? ] trim string-lines
+ [ [ blank? ] split-when harvest ] map harvest ;
+
: numerify ( table -- data names )
unclip [ [ [ string>number ] map ] map ] dip ;
: load-table ( name -- data names )
- load-file [ blank? ] trim string-lines
- [ [ blank? ] split-when ] map numerify ;
+ load-tabular-file numerify ;
: load-table-csv ( name -- data names )
load-file string>csv numerify ;
PRIVATE>
+: load-monks ( name -- data-set )
+ load-tabular-file
+ ! Omits the identifiers which are not so interesting.
+ [ but-last [ string>number ] map ] map
+ [ [ rest ] map ] [ [ first ] map ] bi
+ { "no" "yes" }
+ "monks.names" load-file
+ { "a1" "a2" "a3" "a4" "a5" "a6" } <data-set> ;
+
: load-iris ( -- data-set )
"iris.csv" load-table-csv
[ [ unclip-last ] { } map>assoc unzip ] [ 2 tail ] bi*
--- /dev/null
+1. Title: The Monk's Problems
+
+2. Sources:
+ (a) Donor: Sebastian Thrun
+ School of Computer Science
+ Carnegie Mellon University
+ Pittsburgh, PA 15213, USA
+
+ E-mail: thrun@cs.cmu.edu
+
+ (b) Date: October 1992
+
+3. Past Usage:
+
+ - See File: thrun.comparison.ps.Z
+
+ - Wnek, J., "Hypothesis-driven Constructive Induction," PhD dissertation,
+ School of Information Technology and Engineering, Reports of Machine
+ Learning and Inference Laboratory, MLI 93-2, Center for Artificial
+ Intelligence, George Mason University, March 1993.
+
+ - Wnek, J. and Michalski, R.S., "Comparing Symbolic and
+ Subsymbolic Learning: Three Studies," in Machine Learning: A
+ Multistrategy Approach, Vol. 4., R.S. Michalski and G. Tecuci (Eds.),
+ Morgan Kaufmann, San Mateo, CA, 1993.
+
+4. Relevant Information:
+
+ The MONK's problem were the basis of a first international comparison
+ of learning algorithms. The result of this comparison is summarized in
+ "The MONK's Problems - A Performance Comparison of Different Learning
+ algorithms" by S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B.
+ Cestnik, J. Cheng, K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher,
+ R. Hamann, K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S.
+ Michalski, T. Mitchell, P. Pachowicz, Y. Reich H. Vafaie, W. Van de
+ Welde, W. Wenzel, J. Wnek, and J. Zhang has been published as
+ Technical Report CS-CMU-91-197, Carnegie Mellon University in Dec.
+ 1991.
+
+ One significant characteristic of this comparison is that it was
+ performed by a collection of researchers, each of whom was an advocate
+ of the technique they tested (often they were the creators of the
+ various methods). In this sense, the results are less biased than in
+ comparisons performed by a single person advocating a specific
+ learning method, and more accurately reflect the generalization
+ behavior of the learning techniques as applied by knowledgeable users.
+
+ There are three MONK's problems. The domains for all MONK's problems
+ are the same (described below). One of the MONK's problems has noise
+ added. For each problem, the domain has been partitioned into a train
+ and test set.
+
+5. Number of Instances: 432
+
+6. Number of Attributes: 8 (including class attribute)
+
+7. Attribute information:
+ 1. class: 0, 1
+ 2. a1: 1, 2, 3
+ 3. a2: 1, 2, 3
+ 4. a3: 1, 2
+ 5. a4: 1, 2, 3
+ 6. a5: 1, 2, 3, 4
+ 7. a6: 1, 2
+ 8. Id: (A unique symbol for each instance)
+
+8. Missing Attribute Values: None
+
+9. Target Concepts associated to the MONK's problem:
+
+ MONK-1: (a1 = a2) or (a5 = 1)
+
+ MONK-2: EXACTLY TWO of {a1 = 1, a2 = 1, a3 = 1, a4 = 1, a5 = 1, a6 = 1}
+
+ MONK-3: (a5 = 3 and a4 = 1) or (a5 /= 4 and a2 /= 3)
+ (5% class noise added to the training set)