[prev in list] [next in list] [prev in thread] [next in thread] 

List:       james-dev
Subject:    Re: svn commit: r438517 - /james/server/trunk/src/java/org/apache/james/util/BayesianAnalyzer.java
From:       Vincenzo Gianferrari Pini <vincenzo.gianferraripini () praxis ! it>
Date:       2006-08-31 17:08:05
Message-ID: 1774503.41221157044088989.JavaMail.root () mail01-mi
[Download RAW message or body]

I was wrong, the ratio is "only" 1 to 1174 :-) . But I think substring() 
is still not so relevant... ;-) .

Seriously, FYI, the reason for such high ratio is that substring() works 
at string level, while toLowerCase() breaks down to a character level 
loop, and in my profiling (using a Postage "release" run) the average 
token string length was about 70 chars (probably lots of binary data 
converted to ascii), as the max token length to be considered is set to 
90 chars. Perhaps we could reduce it to a smaller value.

But anyway the involved cpu now has dramatically gone down to very small 
values. And it wasn't so dramatic.

Vincenzo

Bernd Fondermann wrote:

> wow. :-)
> ok, fine.
>
>  Bernd
>
> On 8/31/06, Vincenzo Gianferrari Pini
> <vincenzo.gianferraripini@praxis.it> wrote:
>
>> The substring cpu time is not relevant compared to toLowerCase: 1 to
>> 60000 ratio :-)
>>
>> Let's keep it as is.
>>
>> Vincenzo
>>
>> Stefano Bagnara wrote:
>>
>> > Maybe he's referring to "tokenLower.substring(0, end)".
>> > This appears twice in your code and could be moved to a local 
>> variable.
>> >
>> > Stefano
>> >
>> > Vincenzo Gianferrari Pini wrote:
>> >
>> >> Bernd,
>> >>
>> >> I don't understand what you mean by "duplicated substrings".
>> >>
>> >> If you mean the substrings added to the tokens ArrayList, only the
>> >> most significant of them (highest "probability strength") is later on
>> >> kept by the calling method (getTokenProbabilityStrengths). This is
>> >> the way it is expected to work.
>> >>
>> >> If you have seen something else please let me know.
>> >>
>> >> Vincenzo
>> >>
>> >>
>> >> Bernd Fondermann wrote:
>> >>
>> >>> Vincenzo,
>> >>>
>> >>> do you intend to also eliminate the duplicated substrings or does it
>> >>> not significantly lower memory/cpu load?
>> >>>
>> >>>  Bernd
>> >>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic