org.gcube.dataanalysis.ecoengine.clustering
Class XMeans

java.lang.Object
  extended by weka.clusterers.AbstractClusterer
      extended by weka.clusterers.RandomizableClusterer
          extended by org.gcube.dataanalysis.ecoengine.clustering.XMeans
All Implemented Interfaces:
Serializable, Cloneable, weka.clusterers.Clusterer, weka.core.CapabilitiesHandler, weka.core.OptionHandler, weka.core.Randomizable, weka.core.RevisionHandler, weka.core.TechnicalInformationHandler

public class XMeans
extends weka.clusterers.RandomizableClusterer
implements weka.core.TechnicalInformationHandler

Cluster data using the X-means algorithm.

X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures.

For more information see:

Dan Pelleg, Andrew W. Moore: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Seventeenth International Conference on Machine Learning, 727-734, 2000.

BibTeX:

 @inproceedings{Pelleg2000,
    author = {Dan Pelleg and Andrew W. Moore},
    booktitle = {Seventeenth International Conference on Machine Learning},
    pages = {727-734},
    publisher = {Morgan Kaufmann},
    title = {X-means: Extending K-means with Efficient Estimation of the Number of Clusters},
    year = {2000}
 }
 

Valid options are:

 -I <num>
  maximum number of overall iterations
  (default 1).
 -M <num>
  maximum number of iterations in the kMeans loop in
  the Improve-Parameter part 
  (default 1000).
 -J <num>
  maximum number of iterations in the kMeans loop
  for the splitted centroids in the Improve-Structure part 
  (default 1000).
 -L <num>
  minimum number of clusters
  (default 2).
 -H <num>
  maximum number of clusters
  (default 4).
 -B <value>
  distance value for binary attributes
  (default 1.0).
 -use-kdtree
  Uses the KDTree internally
  (default no).
 -K <KDTree class specification>
  Full class name of KDTree class to use, followed
  by scheme options.
  eg: "weka.core.neighboursearch.kdtrees.KDTree -P"
  (default no KDTree class used).
 -C <value>
  cutoff factor, takes the given percentage of the splitted 
  centroids if none of the children win
  (default 0.0).
 -D <distance function class specification>
  Full class name of Distance function class to use, followed
  by scheme options.
  (default weka.core.EuclideanDistance).
 -N <file name>
  file to read starting centers from (ARFF format).
 -O <file name>
  file to write centers to (ARFF format).
 -U <int>
  The debug level.
  (default 0)
 -Y <file name>
  The debug vectors file.
 -S <num>
  Random number seed.
  (default 10)

Version:
$Revision: 8109 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz), Mark Hall (mhall@cs.waikato.ac.nz), Malcolm Ware (mfw4@cs.waikato.ac.nz)
See Also:
RandomizableClusterer, Serialized Form

Field Summary
static int D_CONVCHCLOSER
          have a closer look at converge children.
static int D_CURR
          for current debug.
static int D_FOLLOWSPLIT
          follows the splitting of the centers.
static int D_GENERAL
          general debugging.
static int D_ITERCOUNT
          follow iterations.
static int D_KDTREE
          check on kdtree.
static int D_METH_MISUSE
          functions were maybe misused.
static int D_PRINTCENTERS
          print the centers.
static int D_RANDOMVECTOR
          check on random vectors.
protected  double m_Bic
          BIC-Score of the current model.
protected  double m_BinValue
          Distance value between true and false of binary attributes and "same" and "different" of nominal attributes (default = 1.0).
protected  Reader m_CenterInput
          input file for the cluster centers.
protected  PrintWriter m_CenterOutput
          output file for the cluster centers.
protected  int[] m_ClusterAssignments
          temporary variable holding cluster assignments while iterating.
protected  weka.core.Instances m_ClusterCenters
          cluster centers.
 boolean m_CurrDebugFlag
          Flag: I'm debugging.
protected  double m_CutOffFactor
          cutoff factor - percentage of splits done in Improve-Structure part only relevant, if all children lost.
protected  int m_DebugLevel
          level of debug output, 0 is no output.
protected  weka.core.Instances m_DebugVectors
          all the debug vectors.
protected  File m_DebugVectorsFile
          file name of the input file for the random vectors.
protected  int m_DebugVectorsIndex
          the index for the current debug vector.
protected  Reader m_DebugVectorsInput
          input file for the random vectors --> USED FOR DEBUGGING.
protected  weka.core.DistanceFunction m_DistanceF
          the distance function used.
protected  File m_InputCenterFile
          file name of the output file for the cluster centers.
protected  weka.core.Instances m_Instances
          training instances.
protected  int m_IterationCount
          counts iterations done in main loop.
protected  weka.core.neighboursearch.KDTree m_KDTree
          KDTrees class if KDTrees are used.
protected  int m_KMeansStopped
          counter to say how often kMeans was stopped by loop counter.
protected  int m_MaxIterations
          maximum overall iterations.
protected  int m_MaxKMeans
          maximum iterations to perform Kmeans part if negative, iterations are not checked.
protected  int m_MaxKMeansForChildren
          see above, but for kMeans of splitted clusters.
protected  int m_MaxNumClusters
          max number of clusters to generate.
protected  int m_MinNumClusters
          min number of clusters to generate.
protected  double[] m_Mle
          Distortion.
protected  weka.core.Instances m_Model
          model information, should increase readability.
protected  int m_NumClusters
          The actual number of clusters.
protected  int m_NumSplits
          Number of splits prepared.
protected  int m_NumSplitsDone
          Number of splits accepted (including cutoff factor decisions).
protected  int m_NumSplitsStillDone
          Number of splits accepted just because of cutoff factor.
protected  File m_OutputCenterFile
          file name of the output file for the cluster centers.
protected  weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
          replace missing values in training instances.
protected  boolean m_UseKDTree
          whether to use the KDTree (the KDTree is only initialized to be configurable from the GUI).
static int R_HIGH
          Index in ranges for HIGH.
static int R_LOW
          Index in ranges for LOW.
static int R_WIDTH
          Index in ranges for WIDTH.
 
Fields inherited from class weka.clusterers.RandomizableClusterer
m_Seed, m_SeedDefault
 
Constructor Summary
XMeans()
          the default constructor.
 
Method Summary
protected  boolean assignToCenters(weka.core.Instances centers, int[][] instOfCent, int[] allInstList, int[] assignments)
          Assign instances to centers.
protected  boolean assignToCenters(weka.core.neighboursearch.KDTree kdtree, weka.core.Instances centers, int[][] instOfCent, int[] assignments, int iterationCount)
          Assign instances to centers using KDtree.
protected  boolean assignToCenters(weka.core.neighboursearch.KDTree tree, weka.core.Instances centers, int[][] instOfCent, int[] allInstList, int[] assignments, int iterationCount)
          Assigns instances to centers.
 String binValueTipText()
          Returns the tip text for this property.
 void buildClusterer(weka.core.Instances data)
          Generates the X-Means clusterer.
protected  double calculateBIC(int[][] instOfCent, weka.core.Instances centers, double[] mle)
          Calculates the BIC for the given set of centers and instances.
protected  double calculateBIC(int[] instList, weka.core.Instance center, double mle, weka.core.Instances model)
          Returns the BIC-value for the given center and instances.
 boolean checkForNominalAttributes(weka.core.Instances data)
          Checks for nominal attributes in the dataset.
protected  void checkInstances()
          Checks the instances.
 int clusterInstance(weka.core.Instance instance)
          Classifies a given instance.
protected  int clusterProcessedInstance(weka.core.Instance instance)
          Clusters an instance that has been through the filters.
protected  int clusterProcessedInstance(weka.core.Instance instance, weka.core.Instances centers)
          Clusters an instance.
 String cutOffFactorTipText()
          Returns the tip text for this property.
 String debugLevelTipText()
          Returns the tip text for this property.
 String debugVectorsFileTipText()
          Returns the tip text for this property.
 String distanceFTipText()
          Returns the tip text for this property.
protected  double[] distortion(int[][] instOfCent, weka.core.Instances centers)
          Calculates the maximum likelihood estimate for the variance.
 double getBinValue()
          Gets value that represents true in a new numeric attribute.
 weka.core.Capabilities getCapabilities()
          Returns default capabilities of the clusterer.
 weka.core.Instances getClusterCenters()
          Return the centers of the clusters as an Instances object
 double getCutOffFactor()
          Gets the cutoff factor.
 int getDebugLevel()
          Gets the debug level.
 File getDebugVectorsFile()
          Gets the file name for a file that has the random vectors stored.
 weka.core.DistanceFunction getDistanceF()
          Gets the distance function.
protected  String getDistanceFSpec()
          Gets the distance function specification string, which contains the class name of the distance function class and any options to it.
 File getInputCenterFile()
          Gets the file to read the list of centers from.
 weka.core.neighboursearch.KDTree getKDTree()
          Gets the KDTree class.
protected  String getKDTreeSpec()
          Gets the KDTree specification string, which contains the class name of the KDTree class and any options to the KDTree.
 int getMaxIterations()
          Gets the maximum number of iterations.
 int getMaxKMeans()
          Gets the maximum number of iterations in KMeans.
 int getMaxKMeansForChildren()
          Gets the maximum number of iterations in KMeans.
 int getMaxNumClusters()
          Gets the maximum number of clusters to generate.
 int getMinNumClusters()
          Gets the minimum number of clusters to generate.
 weka.core.Instance getNextDebugVectorsInstance(weka.core.Instances model)
          Read an instance from debug vectors file.
 String[] getOptions()
          Gets the current settings of SimpleKMeans.
 File getOutputCenterFile()
          Gets the file to write the list of centers to.
 String getRevision()
          Returns the revision string.
 weka.core.TechnicalInformation getTechnicalInformation()
          Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
 boolean getUseKDTree()
          Gets whether the KDTree is used or not.
 String globalInfo()
          Returns a string describing this clusterer.
protected  int[] initAssignments(int numInstances)
          Creates and initializes integer array, used to store assignments.
protected  int[] initAssignments(int[] ass)
          Set array of int, used to store assignments, to -1.
 void initDebugVectorsInput()
          Initialises the debug vector input.
 String inputCenterFileTipText()
          Returns the tip text for this property.
 String KDTreeTipText()
          Returns the tip text for this property.
 Enumeration listOptions()
          Returns an enumeration describing the available options.
protected  double logLikelihoodEstimate(int numInst, weka.core.Instance center, double distortion, int numCent)
          Calculates the log-likelihood of the data for the given model, taken at the maximum likelihood point.
static void main(String[] argv)
          Main method for testing this class.
protected  weka.core.Instances makeCentersRandomly(Random random0, weka.core.Instances model, int numClusters)
          Generates new centers randomly.
 String maxIterationsTipText()
          Returns the tip text for this property.
 String maxKMeansForChildrenTipText()
          Returns the tip text for this property.
 String maxKMeansTipText()
          Returns the tip text for this property.
 String maxNumClustersTipText()
          Returns the tip text for this property.
protected  double meanOrMode(weka.core.Instances instances, int[] instList, int attIndex)
          Computes Mean Or Mode of one attribute on a subset of m_Instances.
 String minNumClustersTipText()
          Returns the tip text for this property.
protected  weka.core.Instances newCentersAfterSplit(boolean[] splitWon, weka.core.Instances splitCenters)
          Returns new centers.
protected  weka.core.Instances newCentersAfterSplit(double[] pbic, double[] cbic, double cutoffFactor, weka.core.Instances splitCenters)
          Returns new center list.
protected  int nextAssignedOne(int cent, int lastIndex, int[] assignments)
          Searches along the assignment array for the next entry of the center in question.
 int numberOfClusters()
          Returns the number of clusters.
 String outputCenterFileTipText()
          Returns the tip text for this property.
protected  void PFD_CURR(String output)
          Does debug printouts.
protected  void PFD(int debugLevel, String output)
          Does debug printouts.
protected  void PrCentersFD(int debugLevel)
          Print centers for debug.
protected  boolean recomputeCenters(weka.core.Instances centers, int[][] instOfCent, weka.core.Instances model)
          Recompute the new centers.
protected  void recomputeCentersFast(weka.core.Instances centers, int[][] instOfCentIndexes, weka.core.Instances model)
          Recompute the new centers - 2nd version Same as recomputeCenters, but does not check if center stays the same.
 void setBinValue(double value)
          Sets the distance value between true and false of binary attributes.
 void setCutOffFactor(double i)
          Sets a new cutoff factor.
 void setDebugLevel(int d)
          Sets the debug level.
 void setDebugVectorsFile(File value)
          Sets the file that has the random vectors stored.
 void setDistanceF(weka.core.DistanceFunction distanceF)
          gets the "binary" distance value.
 void setInputCenterFile(File value)
          Sets the file to read the list of centers from.
 void setKDTree(weka.core.neighboursearch.KDTree k)
          Sets the KDTree class.
 void setMaxIterations(int i)
          Sets the maximum number of iterations to perform.
 void setMaxKMeans(int i)
          Set the maximum number of iterations to perform in KMeans.
 void setMaxKMeansForChildren(int i)
          Sets the maximum number of iterations KMeans that is performed on the child centers.
 void setMaxNumClusters(int n)
          Sets the maximum number of clusters to generate.
 void setMinNumClusters(int n)
          Sets the minimum number of clusters to generate.
 void setOptions(String[] options)
          Parses a given list of options.
 void setOutputCenterFile(File value)
          Sets file to write the list of centers to.
 void setUseKDTree(boolean value)
          Sets whether to use the KDTree or not.
protected  weka.core.Instances splitCenter(Random random, weka.core.Instance center, double variance, weka.core.Instances model)
          Split centers in their region.
protected  weka.core.Instances splitCenters(Random random, weka.core.Instances instances, weka.core.Instances model)
          Split centers in their region.
protected  boolean stopIteration(int iterationCount, int max)
          Checks if iterationCount has to be checked and if yes (this means max is > 0) compares it with max.
protected  boolean stopKMeansIteration(int iterationCount, int max)
          Controls that counter does not exceed max iteration value.
protected  boolean TFD(int debugLevel)
          Tests on debug status.
 String toString()
          Return a string describing this clusterer.
 String useKDTreeTipText()
          Returns the tip text for this property.
 
Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.clusterers.AbstractClusterer
distributionForInstance, forName, makeCopies, makeCopy, runClusterer
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

m_Instances

protected weka.core.Instances m_Instances
training instances.


m_Model

protected weka.core.Instances m_Model
model information, should increase readability.


m_ReplaceMissingFilter

protected weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
replace missing values in training instances.


m_BinValue

protected double m_BinValue
Distance value between true and false of binary attributes and "same" and "different" of nominal attributes (default = 1.0).


m_Bic

protected double m_Bic
BIC-Score of the current model.


m_Mle

protected double[] m_Mle
Distortion.


m_MaxIterations

protected int m_MaxIterations
maximum overall iterations.


m_MaxKMeans

protected int m_MaxKMeans
maximum iterations to perform Kmeans part if negative, iterations are not checked.


m_MaxKMeansForChildren

protected int m_MaxKMeansForChildren
see above, but for kMeans of splitted clusters.


m_NumClusters

protected int m_NumClusters
The actual number of clusters.


m_MinNumClusters

protected int m_MinNumClusters
min number of clusters to generate.


m_MaxNumClusters

protected int m_MaxNumClusters
max number of clusters to generate.


m_DistanceF

protected weka.core.DistanceFunction m_DistanceF
the distance function used.


m_ClusterCenters

protected weka.core.Instances m_ClusterCenters
cluster centers.


m_InputCenterFile

protected File m_InputCenterFile
file name of the output file for the cluster centers.


m_DebugVectorsInput

protected Reader m_DebugVectorsInput
input file for the random vectors --> USED FOR DEBUGGING.


m_DebugVectorsIndex

protected int m_DebugVectorsIndex
the index for the current debug vector.


m_DebugVectors

protected weka.core.Instances m_DebugVectors
all the debug vectors.


m_DebugVectorsFile

protected File m_DebugVectorsFile
file name of the input file for the random vectors.


m_CenterInput

protected Reader m_CenterInput
input file for the cluster centers.


m_OutputCenterFile

protected File m_OutputCenterFile
file name of the output file for the cluster centers.


m_CenterOutput

protected PrintWriter m_CenterOutput
output file for the cluster centers.


m_ClusterAssignments

protected int[] m_ClusterAssignments
temporary variable holding cluster assignments while iterating.


m_CutOffFactor

protected double m_CutOffFactor
cutoff factor - percentage of splits done in Improve-Structure part only relevant, if all children lost.


R_LOW

public static int R_LOW
Index in ranges for LOW.


R_HIGH

public static int R_HIGH
Index in ranges for HIGH.


R_WIDTH

public static int R_WIDTH
Index in ranges for WIDTH.


m_KDTree

protected weka.core.neighboursearch.KDTree m_KDTree
KDTrees class if KDTrees are used.


m_UseKDTree

protected boolean m_UseKDTree
whether to use the KDTree (the KDTree is only initialized to be configurable from the GUI).


m_IterationCount

protected int m_IterationCount
counts iterations done in main loop.


m_KMeansStopped

protected int m_KMeansStopped
counter to say how often kMeans was stopped by loop counter.


m_NumSplits

protected int m_NumSplits
Number of splits prepared.


m_NumSplitsDone

protected int m_NumSplitsDone
Number of splits accepted (including cutoff factor decisions).


m_NumSplitsStillDone

protected int m_NumSplitsStillDone
Number of splits accepted just because of cutoff factor.


m_DebugLevel

protected int m_DebugLevel
level of debug output, 0 is no output.


D_PRINTCENTERS

public static int D_PRINTCENTERS
print the centers.


D_FOLLOWSPLIT

public static int D_FOLLOWSPLIT
follows the splitting of the centers.


D_CONVCHCLOSER

public static int D_CONVCHCLOSER
have a closer look at converge children.


D_RANDOMVECTOR

public static int D_RANDOMVECTOR
check on random vectors.


D_KDTREE

public static int D_KDTREE
check on kdtree.


D_ITERCOUNT

public static int D_ITERCOUNT
follow iterations.


D_METH_MISUSE

public static int D_METH_MISUSE
functions were maybe misused.


D_CURR

public static int D_CURR
for current debug.


D_GENERAL

public static int D_GENERAL
general debugging.


m_CurrDebugFlag

public boolean m_CurrDebugFlag
Flag: I'm debugging.

Constructor Detail

XMeans

public XMeans()
the default constructor.

Method Detail

globalInfo

public String globalInfo()
Returns a string describing this clusterer.

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

getTechnicalInformation

public weka.core.TechnicalInformation getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Specified by:
getTechnicalInformation in interface weka.core.TechnicalInformationHandler
Returns:
the technical information about this class

getCapabilities

public weka.core.Capabilities getCapabilities()
Returns default capabilities of the clusterer.

Specified by:
getCapabilities in interface weka.clusterers.Clusterer
Specified by:
getCapabilities in interface weka.core.CapabilitiesHandler
Overrides:
getCapabilities in class weka.clusterers.AbstractClusterer
Returns:
the capabilities of this clusterer

buildClusterer

public void buildClusterer(weka.core.Instances data)
                    throws Exception
Generates the X-Means clusterer.

Specified by:
buildClusterer in interface weka.clusterers.Clusterer
Specified by:
buildClusterer in class weka.clusterers.AbstractClusterer
Parameters:
data - set of instances serving as training data
Throws:
Exception - if the clusterer has not been generated successfully

checkForNominalAttributes

public boolean checkForNominalAttributes(weka.core.Instances data)
Checks for nominal attributes in the dataset. Class attribute is ignored.

Parameters:
data - the data to check
Returns:
false if no nominal attributes are present

initAssignments

protected int[] initAssignments(int[] ass)
Set array of int, used to store assignments, to -1.

Parameters:
ass - integer array used for storing assignments
Returns:
integer array used for storing assignments

initAssignments

protected int[] initAssignments(int numInstances)
Creates and initializes integer array, used to store assignments.

Parameters:
numInstances - length of array used for assignments
Returns:
integer array used for storing assignments

newCentersAfterSplit

protected weka.core.Instances newCentersAfterSplit(double[] pbic,
                                                   double[] cbic,
                                                   double cutoffFactor,
                                                   weka.core.Instances splitCenters)
Returns new center list. The following steps 1. and 2. both take care that the number of centers does not exceed maxCenters. 1. Compare BIC values of parent and children and takes the one as new centers which do win (= BIC-value is smaller). 2. If in 1. none of the children are chosen && and cutoff factor is > 0 cutoff factor is taken as the percentage of "best" centers that are still taken.

Parameters:
pbic - array of parents BIC-values
cbic - array of childrens BIC-values
cutoffFactor - cutoff factor
splitCenters - all children
Returns:
the new centers

newCentersAfterSplit

protected weka.core.Instances newCentersAfterSplit(boolean[] splitWon,
                                                   weka.core.Instances splitCenters)
Returns new centers. Depending on splitWon: if true takes children, if false takes parent = current center.

Parameters:
splitWon - array of boolean to indicate to take split or not
splitCenters - list of splitted centers
Returns:
the new centers

stopKMeansIteration

protected boolean stopKMeansIteration(int iterationCount,
                                      int max)
Controls that counter does not exceed max iteration value. Special function for kmeans iterations.

Parameters:
iterationCount - current value of counter
max - maximum value for counter
Returns:
true if iteration should be stopped

stopIteration

protected boolean stopIteration(int iterationCount,
                                int max)
Checks if iterationCount has to be checked and if yes (this means max is > 0) compares it with max.

Parameters:
iterationCount - the current iteration count
max - the maximum number of iterations
Returns:
true if maximum has been reached

recomputeCenters

protected boolean recomputeCenters(weka.core.Instances centers,
                                   int[][] instOfCent,
                                   weka.core.Instances model)
Recompute the new centers. New cluster center is center of mass of its instances. Returns true if cluster stays the same.

Parameters:
centers - the input and output centers
instOfCent - the instances to the centers
model - data model information
Returns:
true if converged.

recomputeCentersFast

protected void recomputeCentersFast(weka.core.Instances centers,
                                    int[][] instOfCentIndexes,
                                    weka.core.Instances model)
Recompute the new centers - 2nd version Same as recomputeCenters, but does not check if center stays the same.

Parameters:
centers - the input center and output centers
instOfCentIndexes - the indexes of the instances to the centers
model - data model information

meanOrMode

protected double meanOrMode(weka.core.Instances instances,
                            int[] instList,
                            int attIndex)
Computes Mean Or Mode of one attribute on a subset of m_Instances. The subset is defined by an index list.

Parameters:
instances - all instances
instList - the indexes of the instances the mean is computed from
attIndex - the index of the attribute
Returns:
mean value

assignToCenters

protected boolean assignToCenters(weka.core.neighboursearch.KDTree tree,
                                  weka.core.Instances centers,
                                  int[][] instOfCent,
                                  int[] allInstList,
                                  int[] assignments,
                                  int iterationCount)
                           throws Exception
Assigns instances to centers.

Parameters:
tree - KDTree on all instances
centers - all the input centers
instOfCent - the instances to each center
allInstList - list of all instances
assignments - assignments of instances to centers
iterationCount - the number of iteration
Returns:
true if converged
Throws:
Exception - is something goes wrong

assignToCenters

protected boolean assignToCenters(weka.core.neighboursearch.KDTree kdtree,
                                  weka.core.Instances centers,
                                  int[][] instOfCent,
                                  int[] assignments,
                                  int iterationCount)
                           throws Exception
Assign instances to centers using KDtree. First part of conventionell K-Means, returns true if new assignment is the same as the last one.

Parameters:
kdtree - KDTree on all instances
centers - all the input centers
instOfCent - the instances to each center
assignments - assignments of instances to centers
iterationCount - the number of iteration
Returns:
true if converged
Throws:
Exception - in case instances are not assigned to cluster

assignToCenters

protected boolean assignToCenters(weka.core.Instances centers,
                                  int[][] instOfCent,
                                  int[] allInstList,
                                  int[] assignments)
                           throws Exception
Assign instances to centers. Part of conventionell K-Means, returns true if new assignment is the same as the last one.

Parameters:
centers - all the input centers
instOfCent - the instances to each center
allInstList - list of all indexes
assignments - assignments of instances to centers
Returns:
true if converged
Throws:
Exception - if something goes wrong

nextAssignedOne

protected int nextAssignedOne(int cent,
                              int lastIndex,
                              int[] assignments)
Searches along the assignment array for the next entry of the center in question.

Parameters:
cent - index of the center
lastIndex - index to start searching
assignments - assignments
Returns:
index of the instance the center cent is assigned to

splitCenter

protected weka.core.Instances splitCenter(Random random,
                                          weka.core.Instance center,
                                          double variance,
                                          weka.core.Instances model)
                                   throws Exception
Split centers in their region. Generates random vector of length = variance and adds and substractsx to cluster vector to get two new clusters.

Parameters:
random - random function
center - the center that is split here
variance - variance of the cluster
model - data model valid
Returns:
a pair of new centers
Throws:
Exception - something in AlgVector goes wrong

splitCenters

protected weka.core.Instances splitCenters(Random random,
                                           weka.core.Instances instances,
                                           weka.core.Instances model)
Split centers in their region. (*Alternative version of splitCenter()*)

Parameters:
random - the random number generator
instances - of the region
model - the model for the centers (should be the same as that of instances)
Returns:
a pair of new centers

makeCentersRandomly

protected weka.core.Instances makeCentersRandomly(Random random0,
                                                  weka.core.Instances model,
                                                  int numClusters)
Generates new centers randomly. Used for starting centers.

Parameters:
random0 - random number generator
model - data model of the instances
numClusters - number of clusters
Returns:
new centers

calculateBIC

protected double calculateBIC(int[] instList,
                              weka.core.Instance center,
                              double mle,
                              weka.core.Instances model)
Returns the BIC-value for the given center and instances.

Parameters:
instList - The indices of the instances that belong to the center
center - the center.
mle - maximum likelihood
model - the data model
Returns:
the BIC value

calculateBIC

protected double calculateBIC(int[][] instOfCent,
                              weka.core.Instances centers,
                              double[] mle)
Calculates the BIC for the given set of centers and instances.

Parameters:
instOfCent - The instances that belong to their respective centers
centers - the centers
mle - maximum likelihood
Returns:
The BIC for the input.

logLikelihoodEstimate

protected double logLikelihoodEstimate(int numInst,
                                       weka.core.Instance center,
                                       double distortion,
                                       int numCent)
Calculates the log-likelihood of the data for the given model, taken at the maximum likelihood point.

Parameters:
numInst - number of instances that belong to the center
center - the center
distortion - distortion
numCent - number of centers
Returns:
the likelihood estimate

distortion

protected double[] distortion(int[][] instOfCent,
                              weka.core.Instances centers)
Calculates the maximum likelihood estimate for the variance.

Parameters:
instOfCent - indices of instances to each center
centers - the centers
Returns:
the list of distortions distortion.

clusterProcessedInstance

protected int clusterProcessedInstance(weka.core.Instance instance,
                                       weka.core.Instances centers)
Clusters an instance.

Parameters:
instance - the instance to assign a cluster to.
centers - the centers to cluster the instance to.
Returns:
a cluster index.

clusterProcessedInstance

protected int clusterProcessedInstance(weka.core.Instance instance)
Clusters an instance that has been through the filters.

Parameters:
instance - the instance to assign a cluster to
Returns:
a cluster number

clusterInstance

public int clusterInstance(weka.core.Instance instance)
                    throws Exception
Classifies a given instance.

Specified by:
clusterInstance in interface weka.clusterers.Clusterer
Overrides:
clusterInstance in class weka.clusterers.AbstractClusterer
Parameters:
instance - the instance to be assigned to a cluster
Returns:
the number of the assigned cluster as an integer if the class is enumerated, otherwise the predicted value
Throws:
Exception - if instance could not be classified successfully

numberOfClusters

public int numberOfClusters()
Returns the number of clusters.

Specified by:
numberOfClusters in interface weka.clusterers.Clusterer
Specified by:
numberOfClusters in class weka.clusterers.AbstractClusterer
Returns:
the number of clusters generated for a training dataset.

listOptions

public Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface weka.core.OptionHandler
Overrides:
listOptions in class weka.clusterers.RandomizableClusterer
Returns:
an enumeration of all the available options

minNumClustersTipText

public String minNumClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMinNumClusters

public void setMinNumClusters(int n)
Sets the minimum number of clusters to generate.

Parameters:
n - the minimum number of clusters to generate

getMinNumClusters

public int getMinNumClusters()
Gets the minimum number of clusters to generate.

Returns:
the minimum number of clusters to generate

maxNumClustersTipText

public String maxNumClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxNumClusters

public void setMaxNumClusters(int n)
Sets the maximum number of clusters to generate.

Parameters:
n - the maximum number of clusters to generate

getMaxNumClusters

public int getMaxNumClusters()
Gets the maximum number of clusters to generate.

Returns:
the maximum number of clusters to generate

maxIterationsTipText

public String maxIterationsTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxIterations

public void setMaxIterations(int i)
                      throws Exception
Sets the maximum number of iterations to perform.

Parameters:
i - the number of iterations
Throws:
Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Gets the maximum number of iterations.

Returns:
the number of iterations

maxKMeansTipText

public String maxKMeansTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxKMeans

public void setMaxKMeans(int i)
Set the maximum number of iterations to perform in KMeans.

Parameters:
i - the number of iterations

getMaxKMeans

public int getMaxKMeans()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

maxKMeansForChildrenTipText

public String maxKMeansForChildrenTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setMaxKMeansForChildren

public void setMaxKMeansForChildren(int i)
Sets the maximum number of iterations KMeans that is performed on the child centers.

Parameters:
i - the number of iterations

getMaxKMeansForChildren

public int getMaxKMeansForChildren()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

cutOffFactorTipText

public String cutOffFactorTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setCutOffFactor

public void setCutOffFactor(double i)
Sets a new cutoff factor.

Parameters:
i - the new cutoff factor

getCutOffFactor

public double getCutOffFactor()
Gets the cutoff factor.

Returns:
the cutoff factor

binValueTipText

public String binValueTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getBinValue

public double getBinValue()
Gets value that represents true in a new numeric attribute. (False is always represented by 0.0.)

Returns:
the value that represents true in a new numeric attribute

setBinValue

public void setBinValue(double value)
Sets the distance value between true and false of binary attributes. and "same" and "different" of nominal attributes

Parameters:
value - the distance

distanceFTipText

public String distanceFTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDistanceF

public void setDistanceF(weka.core.DistanceFunction distanceF)
gets the "binary" distance value.

Parameters:
distanceF - the distance function with all options set

getDistanceF

public weka.core.DistanceFunction getDistanceF()
Gets the distance function.

Returns:
the distance function

getDistanceFSpec

protected String getDistanceFSpec()
Gets the distance function specification string, which contains the class name of the distance function class and any options to it.

Returns:
the distance function specification string

debugVectorsFileTipText

public String debugVectorsFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDebugVectorsFile

public void setDebugVectorsFile(File value)
Sets the file that has the random vectors stored. Only used for debugging reasons.

Parameters:
value - the file to read the random vectors from

getDebugVectorsFile

public File getDebugVectorsFile()
Gets the file name for a file that has the random vectors stored. Only used for debugging purposes.

Returns:
the file to read the vectors from

initDebugVectorsInput

public void initDebugVectorsInput()
                           throws Exception
Initialises the debug vector input.

Throws:
Exception - if there is error opening the debug input file.

getNextDebugVectorsInstance

public weka.core.Instance getNextDebugVectorsInstance(weka.core.Instances model)
                                               throws Exception
Read an instance from debug vectors file.

Parameters:
model - the data model for the instance.
Returns:
the next debug vector.
Throws:
Exception - if there are no debug vector in m_DebugVectors.

inputCenterFileTipText

public String inputCenterFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setInputCenterFile

public void setInputCenterFile(File value)
Sets the file to read the list of centers from.

Parameters:
value - the file to read centers from

getInputCenterFile

public File getInputCenterFile()
Gets the file to read the list of centers from.

Returns:
the file to read the centers from

outputCenterFileTipText

public String outputCenterFileTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setOutputCenterFile

public void setOutputCenterFile(File value)
Sets file to write the list of centers to.

Parameters:
value - file to write centers to

getOutputCenterFile

public File getOutputCenterFile()
Gets the file to write the list of centers to.

Returns:
filename of the file to write centers to

KDTreeTipText

public String KDTreeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setKDTree

public void setKDTree(weka.core.neighboursearch.KDTree k)
Sets the KDTree class.

Parameters:
k - a KDTree object with all options set

getKDTree

public weka.core.neighboursearch.KDTree getKDTree()
Gets the KDTree class.

Returns:
the configured KDTree

useKDTreeTipText

public String useKDTreeTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setUseKDTree

public void setUseKDTree(boolean value)
Sets whether to use the KDTree or not.

Parameters:
value - if true the KDTree is used

getUseKDTree

public boolean getUseKDTree()
Gets whether the KDTree is used or not.

Returns:
true if KDTrees are used

getKDTreeSpec

protected String getKDTreeSpec()
Gets the KDTree specification string, which contains the class name of the KDTree class and any options to the KDTree.

Returns:
the KDTree string.

debugLevelTipText

public String debugLevelTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDebugLevel

public void setDebugLevel(int d)
Sets the debug level. debug level = 0, means no output

Parameters:
d - debuglevel

getDebugLevel

public int getDebugLevel()
Gets the debug level.

Returns:
debug level

checkInstances

protected void checkInstances()
Checks the instances. No checks in this KDTree but it calls the check of the distance function.


setOptions

public void setOptions(String[] options)
                throws Exception
Parses a given list of options.

Valid options are:

 -I <num>
  maximum number of overall iterations
  (default 1).
 -M <num>
  maximum number of iterations in the kMeans loop in
  the Improve-Parameter part 
  (default 1000).
 -J <num>
  maximum number of iterations in the kMeans loop
  for the splitted centroids in the Improve-Structure part 
  (default 1000).
 -L <num>
  minimum number of clusters
  (default 2).
 -H <num>
  maximum number of clusters
  (default 4).
 -B <value>
  distance value for binary attributes
  (default 1.0).
 -use-kdtree
  Uses the KDTree internally
  (default no).
 -K <KDTree class specification>
  Full class name of KDTree class to use, followed
  by scheme options.
  eg: "weka.core.neighboursearch.kdtrees.KDTree -P"
  (default no KDTree class used).
 -C <value>
  cutoff factor, takes the given percentage of the splitted 
  centroids if none of the children win
  (default 0.0).
 -D <distance function class specification>
  Full class name of Distance function class to use, followed
  by scheme options.
  (default weka.core.EuclideanDistance).
 -N <file name>
  file to read starting centers from (ARFF format).
 -O <file name>
  file to write centers to (ARFF format).
 -U <int>
  The debug level.
  (default 0)
 -Y <file name>
  The debug vectors file.
 -S <num>
  Random number seed.
  (default 10)

Specified by:
setOptions in interface weka.core.OptionHandler
Overrides:
setOptions in class weka.clusterers.RandomizableClusterer
Parameters:
options - the list of options as an array of strings
Throws:
Exception - if an option is not supported

getOptions

public String[] getOptions()
Gets the current settings of SimpleKMeans.

Specified by:
getOptions in interface weka.core.OptionHandler
Overrides:
getOptions in class weka.clusterers.RandomizableClusterer
Returns:
an array of strings suitable for passing to setOptions

toString

public String toString()
Return a string describing this clusterer.

Overrides:
toString in class Object
Returns:
a description of the clusterer as a string

getClusterCenters

public weka.core.Instances getClusterCenters()
Return the centers of the clusters as an Instances object

Returns:
the cluster centers.

PrCentersFD

protected void PrCentersFD(int debugLevel)
Print centers for debug.

Parameters:
debugLevel - level that gives according messages

TFD

protected boolean TFD(int debugLevel)
Tests on debug status.

Parameters:
debugLevel - level that gives according messages
Returns:
true if debug level is set

PFD

protected void PFD(int debugLevel,
                   String output)
Does debug printouts.

Parameters:
debugLevel - level that gives according messages
output - string that is printed

PFD_CURR

protected void PFD_CURR(String output)
Does debug printouts.

Parameters:
output - string that is printed

getRevision

public String getRevision()
Returns the revision string.

Specified by:
getRevision in interface weka.core.RevisionHandler
Overrides:
getRevision in class weka.clusterers.AbstractClusterer
Returns:
the revision

main

public static void main(String[] argv)
Main method for testing this class.

Parameters:
argv - should contain options


Copyright © 2013. All Rights Reserved.