'Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input da'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       jakarta-commons-dev
Subject:    Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input da
From:       Artem Barger <artem () bargr ! net>
Date:       2016-04-30 23:57:59
Message-ID: CAG3-T34dCiQvwd9E=sNrXe8CBwD_v6Q9wTR+VHQEQwBiK+Tsgg () mail ! gmail ! com
[Download RAW message or body]


Hi,


On Sun, May 1, 2016 at 12:25 AM, Gilles <gilles@harfang.homelinux.org>
wrote:

Therefore I started to wonder why not to use RealVector
>
>> instead, since it has sparse implementation so I will be able to leverage
>> it.
>>
>
> The principle is fine; but I'm wary to use "RealVector" in new code
> since it must be refactored...
>
>
By refactoring, do you mean MATH-765?



> Right now using kmeans++ clustering algorithm provided by common.maths
>> it's not doable to cluster entire wikipedia dataset or any other huge
>> datasets.
>>
>
> Could you expand on this application?  What is the data?
>
>
The data is tf-idf matrix of wikipedia data set produced by gensim. I'm
trying to run coreset algorithm with Spark streaming
and I need kmeans++ for local computation.


>
> [One of the things to bear in mind while testing new implementations
> is to not loose performance on the other classes of problems (or you'll
> take heat from some of this list's observers...).]
>
>
Is there any performance tests or regression, so it could be used to
validate whenever new implementation doesn't introduced
significant regression?


Best,
        Artem.



>
> Best regards,
> Gilles
>
>
>
>>
>> Best regards,
>>                       Artem Barger.
>>
>> On Sat, Apr 30, 2016 at 11:41 PM, Gilles <gilles@harfang.homelinux.org>
>> wrote:
>>
>> On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote:
>>>
>>> Hi All,
>>>>
>>>> I'd like to provide a solution for [MATH-1330] issue. Before starting I
>>>> have a concerns regarding the possible design and the actual
>>>> implementation.
>>>>
>>>> Currently all implementations of Clusterer interface expect to receive
>>>> instance of DistanceMeasure class, which used to compute distance or
>>>> metric
>>>> between two points. Switching clustering algorithms to work with Vectors
>>>> will make this unnecessary, therefore there will be no need to provide
>>>> DistanceMeasure, since Vector class already provides methods to compute
>>>> vector norms.
>>>>
>>>>
>>> I think that reasons for using "double[]" in the "o.a.c.m.ml.clustering"
>>> package were:
>>>  * simple and straightforward (fixed dimension Cartesian coordinates)
>>>  * not couple it with the "o.a.c.m.linear" package whose "RealVector" is
>>>    for variable size sequences of elements (and is also, inconsistently,
>>>    used as a Cartesian vector, and also as column- and row-matrix[1])
>>>
>>> It is arguable adapted for a family of problems which the developer
>>> probably had in mind when taking those design decisions.
>>>
>>> It would be interesting to know for which class of problems, the design
>>> is inappropriate, in order to clarify ideas.
>>>
>>> The main drawback of this approach is that we will loose the ability to
>>>
>>>> control which metric to use during clustering, however the only classes
>>>> which make an implicit use of this parameters are: Clusterer and
>>>> KmeansPlusPlusClusterer all others assumes EucledianDistance by default.
>>>>
>>>>
>>> There is a default indeed, but all "Clusterer" implementations use
>>> whatever "DistanceMeasure" has been passed to the constructor.
>>>
>>> Assuming that "RealVector" knows how to compute the distance means that
>>> users will have to implement their own subclass of "RealVector" and
>>> override "getDistance(RealVector)" if they want another distance.
>>> Alternatively, CM would have to define all these classes.
>>>
>>> At first sight, it does not seem the right way to go...
>>>
>>> One of the possible approaches is to extend DistanceMeasure interface to
>>> be
>>>
>>>> able to compute distance between two vectors? After all it's only sub
>>>> first
>>>> vector from the second and compute desired norm on the result.
>>>>
>>>>
>>> Seems good (at first sight) but (IMHO) only if we implement a new
>>> "CartesianVector" class unencumbered with all the cruft of "RealVector".
>>>
>>> Another possible solution is to make vector to return it's coordinates,
>>>
>>>> hence it avail us to use DistanceMeasure as is. Personally I do not
>>>> think
>>>> this is good approach, since it will make no sense with sparse vectors.
>>>>
>>>>
>>> Ruled out indeed if it conflicts with your intended usage.
>>>
>>> Last alternative this comes to my mind is to create a set of enums to
>>>
>>>> indicate which vector norm to use to compute distances, also do no think
>>>> this is very good solution, since sounds too intrusive and might break
>>>> backward compatibility.
>>>>
>>>>
>>> And forward compatibility (clustering code will have to be adapted if
>>> another distance is added later).
>>>
>>> What do you think? Am I missing something? Is there a better possible way
>>>
>>>> to achieve the goal?
>>>>
>>>>
>>> As indicated above, a practical example might help visualize the options.
>>>
>>>
>>> Regards,
>>> Gilles
>>>
>>> [1] Cf. https://issues.apache.org/jira/browse/MATH-765
>>>
>>>
>>> Best regards,
>>>>                       Artem Barger.
>>>>
>>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic