'Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       jakarta-commons-dev
Subject:    Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
From:       Gilles Sadowski <gilleseran () gmail ! com>
Date:       2020-02-28 16:04:53
Message-ID: CACioL_EX=_zmFgtiBNV8MoDjAbGqZKRToi1rbUqoNfGn7f98yQ () mail ! gmail ! com
[Download RAW message or body]

Hi.

Le ven. 28 f=C3=A9vr. 2020 =C3=A0 05:04, chentao106@qq.com <chentao106@qq.c=
om> a =C3=A9crit :
>
>
>
>
> --------------
> chentao106@qq.com
> >Hi.
> >
> >Le jeu. 27 f=C3=A9vr. 2020 =C3=A0 06:17, chentao106@qq.com <chentao106@q=
q.com> a =C3=A9crit :
> >>
> >> Hi,
> >>
> >> > [...]
> >> >> >>
> >> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroi=
dOf" and "chooseInitialCenters",
> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after a=
ll done?
> >> >> >
> >> >> >I cannot guarantee that the whole process will be streamlined.
> >> >> >In effect, you can work on multiple branches (one for each
> >> >> >prospective PR).
> >> >> >I'd say that you should start by describing (here on the ML) the
> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
> >> >> >base class).
> >> >> >[Only when the design has been agreed on,  a JIRA issue to
> >> >> >implement it should be created in order to track the actual
> >> >> >coding work).]
> >> >>
> >> >> OK, I think we should start from here:
> >> >>
> >> >> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPl=
usClusterer
> >> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClus=
terer which I want to implement.
> >> >>
> >> >> There are two solution for reuse "centroidOf"  and "chooseInitialCe=
nters":
> >> >> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMe=
ansClusterer",
> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected met=
hods in it;
> >> >>  the EmptyClusterStrategy and related logic can also move to the "A=
bstractKMeansClusterer".
> >> >> 2. Create a static utility class, and move "centroidOf"  and "choos=
eInitialCenters" in it,
> >> >>  and some useful clustering method like predict(Predict which clust=
er is best for a specified point) can put in it.
> >> >>
> >> >
> >> >At first sight, I prefer option 1.
> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no i=
nterest to
> >> >users of the functionality (and so should not be part of the "public"=
 API).
> >>
> >> Persuasive explain, and I agree with you, that extract a abstract clas=
s for KMeans is better.
> >> And how can we make a conclusion?
> >> ---------------------------------------------
> >>
> >> Mention the "public API", I suppose there should be a series of "Centr=
oidInitializer",
> >>  that "chooseInitialCenters" with various of algorithms.
> >> The k-means++ cluster algorithm is a special implementation of k-means
> >>  which initialize cluster centers with k-means++ algorithm.
> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can =
be "KMeansClusterer"
> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", =
it is a common k-means.
> >>
> >> ----------------------------------------------------------
> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
> >> >the "Cluster"
> >> >interface?  What is the difference with method "getCenter" (define by=
 class
> >> >"CentroidCluster")?
> >>
> >> My understanding is,:
> >>  * "Cluster" is a data class that carry the result of a clustering,
> >> "getCenter" is just a get method of CentroidCluster for get the value =
of a center point.
> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to=
 sets of Cluster.
> >>  * "CentroidCluster" is the result of a group of special Clusterer alg=
orithm like k-means,
> >>  "centroidOf" is a specific logic to calculate the center point for a =
collection of points.
> >> [Instead the DBScan cluster algorithm dose not care about the "Centroi=
d"]
> >>
> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists y=
et),
> >>  but different with "CentroidCluster.getCenter".
> >
> >I may be missing something about the existing design,
> >but it seems strange that "CentroidCluster" is initialized
> >with a given "center", yet it is possible to add points after
> >initialization (which IIUC would invalidate the "center").
>
> The "centroidOf" could be part of "CentroidCluster",
> but I think the existsing desgin was focus on decouple of "DistanceMeasur=
e"("centroidOf" depends on it) and "CentroidCluster".

I don't see why we need both "Cluster" and "CentroidCluster".
Indeed, as suggested before, the "center" can be computed
from a "Cluster", but does not need to be stored in it.

>
> Center recalculate often happens in each iteration of k-means Clustering,
> always with points reassign to clusters.
> We often use k-means as two pharse:
> Pharse 1: Training, classify thousands of points to set of clusters.
> Pharse 2: Predict, predict which cluster is best for a new point,
> or add a new point to the best cluster in ClusterSet,

Method "cluster" returns a "List<Cluster>"; there is no need for a
new "ClusterSet" class.
Also, IIUC the centers can be collected into a "List<Clusterable>",
so that the association is through the index into the list(s).

> but we never update the cluster center until next retraining.

IMO, that's the reason for *not*" storing the center (in such a
mutable instance).

>
> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math"=
 just design for pharse "Training",
> it is clearly if we can consider "CentroidCluster" as a pure data class j=
ust for k-means clustering result.

See above.

Discussing the existing design further, I think that the "cluster" method s=
hould
rather be:
---CUT---
public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist=
)
---CUT---

And, similarly,
---CUT---
@FunctionalInterface
public interface ClusterFinder<T extends Clusterable> {
    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure d=
ist);
}
---CUT---

> If we want the cluster result useful enough for parse "Predict",
>  the result of "KMeansPlusPlusClusterer.cluster" should return a  "Cluste=
rSet":
> ```java
> public interface ClusterSet<T extends Clusterable> extends Collection<T> =
{
>   // Retrun the cluster which the point should belong to.
>   Cluster predict(T point);
>   // Add a point to best cluster.
>   void addPoint(T point);
> }
> ```

This  "ClusterSet" seems less flexible than a "List<Cluster>".

> And "centroidOf"(just used in clustering iteration) can move up into a ab=
stract class like "CenroidClusterer".

It seems that this method could be useful for users too.

Best,
Gilles

> >It would seem that "center" should be a property computed
> >from the contents of "Cluster" e.g.:
> >
> >@FunctionalInterface
> >public interface ClusterCenterComputer<T extends Clusterable> {
> >    T centroidOf(Cluster<T> cluster);
> >}
> >
> >Regards,
> >Gilles
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic