%DeepSee.extensions.clusters.AbstractModel

Class %DeepSee.extensions.clusters.AbstractModel Extends %RegisteredObject [ System = 4 ]

This class provides a base class for implementation for different Cluster Analysis algorithms. It defines storage for clustering models and provides methods to retrieve information about data and clustering.

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics.

By Default model data is stored in ^IRIS.Temp globals.

Properties

Dim

Property Dim As %Integer;

DSName

Property DSName As %String;

Normalize

Property Normalize As %Boolean [ InitialExpression = 1 ];

Whether to normalize distance across multiple dimensions. If set to 1 (default) then distance is normalized by variances.

P

Property P As %Double [ InitialExpression = 2 ];

The power to use in calculation of dissimilarity. Default is Euclidean distance (P=2). Specify 1 for Manhattan Distance or 100 for Chebyshev distance (max between coordinates).

Verbose

Property Verbose As %Boolean [ InitialExpression = 1 ];

Methods

Exists

ClassMethod Exists(dataset As %String) As %Boolean

Checks whether a model for a dataset with the name given by dataset argument already exists.

Delete

ClassMethod Delete(dataset As %String) As %Status

Deletes a model for a dataset with the name given by dataset argument.

Check

ClassMethod Check(dataset As %String, exists As %Boolean) As %Status [ Internal ]

IsPrepared

Method IsPrepared() As %Boolean

Checks whether the model is ready for an analysis to be executed. This is dependent on a specific algorithm and therefore this method is overriden by subclasses.

Reset

Method Reset()

Kills all the data associated with this model.

SetData

Method SetData(rs As %IResultSet, dim As %Integer, nullReplacement As %Double = -1) As %Status

Sets the data to be associated with this model. The method takes 3 arguments:

rs - is a result set that provides the data. The first column returned by the result set is assumed to be a unique Id of teh record. It is not used in any clustering algorithms but can be retrieved by the application to identify the record. It can be a database %ID or any other value that makes sense to the application. Other columns provide numerical values for the coordinates of the record that are used by clustering algorithms. Result Set must contain at least dim + 1 columns.
dim - The dimensionality of the model, i.e. the number of the coordinates consumed by clustering algorithm.
nullReplacement - Optional, of specified this is a numeric replacement for empty values.

dist

Method dist(i As %Integer, j As %Integer) As %Double [ Internal ]

Distance

Method Distance(i As %Integer, j As %Integer, p As %Double = 2, normalize As %Boolean = 1) As %Double

Returns the dissimilarity measure between two data points of the model. The method takes 4 arguments:

i, j - Ordinal number of the data points in the model
p - Optional, if specified the power for a Minkowski distance. Default is Euclidean distance (p=2). Specify 1 for Manhattan Distance or 100 for Chebyshev distance (max between coordinates).
normalize - whether to normalize coordinates by their variances

Distance1

Method Distance1(i As %Integer, ByRef z, p As %Double = 2, normalize As %Boolean = 1) As %Double

Returns the dissimilarity measure between a data points of the model and a point with given coordinates. The method takes 4 arguments:

i - The ordinal number of the data point in the model
z - The multidimensional coordinates of the second point: z(1), z(2), ..., z(dim)
p - Optional, if specified the power for a Minkowski distance. Default is Euclidean distance (p=2). Specify 1 for Manhattan Distance or 100 for Chebyshev distance (max between coordinates).
normalize - whether to normalize coordinates by their variances

Distance12

Method Distance12(ByRef z1, ByRef z2, p As %Double = 2, normalize As %Boolean = 1) As %Double

Returns the dissimilarity measure between two points with given coordinates. The method takes 4 arguments:

z1, z2 - The multidimensional coordinates of the points: z1(1), z1(2), ..., z1(dim)
p - Optional, if specified the power for a Minkowski distance. Default is Euclidean distance (p=2). Specify 1 for Manhattan Distance or 100 for Chebyshev distance (max between coordinates).
normalize - whether to normalize coordinates by their variances

GetNumberOfClusters

Method GetNumberOfClusters() As %Integer

Returns the number of clusters in the model.

GetCount

Method GetCount() As %Integer

Returns the number of all data points in the model.

GetId

Method GetId(i As %Integer) As %String

Returns the unque Id of the point with the ordinal number specified by i. The unique Id is as has been assigned in SetData() method

ById

Method ById(id As %RawString) As %Integer

Returns the ordinal number of the point with the given ID id. The unique id must correspond to the one assigned in SetData() method

GetData

Method GetData(i As %Integer, j As %Integer) As %String

GetDimensions

Method GetDimensions() As %Integer

Returns the dimensionality of the model.

GetCluster

Method GetCluster(point As %Integer) As %Integer

Returns the cluster ordinal for a given point. Point is identified by its ordinal number.

GetCost

Method GetCost(i As %Integer, j As %Integer) As %Integer

Returns the dissimilarity measure as used by this clustering algorithm between two data points of the model. Points are identified by their ordinal numbers.

iterateCluster

Method iterateCluster(k As %Integer, ByRef i As %Integer, Output id As %String, Output coordinates)

Iterates over all the data points assigned to a given cluster. Cluster is identified by its ordinal number k

printCluster

Method printCluster(k As %Integer)

Convenience method. Writes all data points assigned to a given cluster to the default output device. Cluster is identified by its ordinal number k

GetCentroid

Method GetCentroid(k As %Integer, Output z)

Returns the coordinates for the centroid for a given cluster. Cluster is identified by its ordinal number k.
Coordinates are returned as multidimensional value: z(1), z(2), ..., z(dim)

GlobalCentroid

Method GlobalCentroid(Output z)

Returns the coordinates for the centroid for the whole dataset.
Coordinates are returned as multidimensional value: z(1), z(2), ..., z(dim)

SubsetCentroid

Method SubsetCentroid(key As %String, Output z)

GetClusterSize

Method GetClusterSize(k As %Integer)

Returns the number of data points assigned to a given cluster. Cluster is identified by its ordinal number k.

printAll

Method printAll()

Convenience method. Writes all data points in the dataset to the default output device.

RelativeClusterCost

Method RelativeClusterCost(k As %Integer, m As %Integer) As %Double

Returns the realtive cost of a given cluster relative to a medoid point m. Cluster is identified by its ordinal number k. Point m is identified by its ordinal number.

GetCalinskiHarabaszIndex

Method GetCalinskiHarabaszIndex(normalize As %Integer = -1) As CalinskiHarabasz

Returns an object that can calculate an index used in Cluster Validation and determining the optimal number of clusters. This method returns Calinski-Harabasz index.

GetASWIndex

Method GetASWIndex() As ASW

Returns an object that can calculate an index used in Cluster Validation and determining the optimal number of clusters. This method returns Average Silhouette Width index.

GetPearsonGammaIndex

Method GetPearsonGammaIndex() As PearsonGamma

Returns an object that can calculate an index used in Cluster Validation and determining the optimal number of clusters. This method returns Pearson-Gamma index which is a correlation coefficient between distance between two points and a binary function whether they belong to the same cluster. This index is useful when clustering is used for dimension reduction i.e. the process of reducing the number of random variables under consideration

time

ClassMethod time(ByRef ts) As %Double [ Internal ]

randomSubset

Method randomSubset(size As %Integer, ByRef sc As %Status) As %Integer

GeneratePMML

Method GeneratePMML(Output pPMML As %DeepSee.PMML.Definition.PMML, ByRef pClusterNames) As %Status

Properties​

Dim​

DSName​

Normalize​

P​

Verbose​

Methods​

Exists​

Delete​

Check​

IsPrepared​

Reset​

SetData​

dist​

Distance​

Distance1​

Distance12​

GetNumberOfClusters​

GetCount​

GetId​

ById​

GetData​

GetDimensions​

GetCluster​

GetCost​

iterateCluster​

printCluster​

GetCentroid​

GlobalCentroid​

SubsetCentroid​

GetClusterSize​

printAll​

RelativeClusterCost​

GetCalinskiHarabaszIndex​

GetASWIndex​

GetPearsonGammaIndex​

time​

randomSubset​

GeneratePMML​