'Re: multi processor scaling'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       relax-devel
Subject:    Re: multi processor scaling
From:       "Gary S. Thompson" <garyt () domain ! hid>
Date:       2007-04-20 16:05:53
Message-ID: 4628E4E1.9030302 () domain ! hid
[Download RAW message or body]

Edward d'Auvergne wrote:

> The scaling is looking awesome!  Obviously the MC sims will need work
> and other tests will be required.  But the functionality of the branch
> is looking very promising and exciting.  I have more responses below.
>
> Please expect to have delayed responses to the previous messages.  I
> will respond to your posts Gary, but I'm three days away from leaving
> Australia and am flat out organising and packing.  


I wasn't really expecting a reply for a while I know what it's like 
leaving a continent (I haven't done it myself but have watched lots of 
other people do it ;-)

> I'll then be
> spending a week in London before heading to Germany.


Ironically you and chris won't meet, but we should arrange for you to 
come and visit Leeds sometime Steve (Homans) would be quite keen

> It could be a
> few weeks before I'll be able to properly response to posts.


No problem

>
>
> On 4/20/07, Gary S. Thompson <garyt@domain.hid> wrote:
>
>>
>>  Dear All
>>      I have now had a chance to do some true multi tasking on our 
>> local cluster with real overhead from intreprocess communication and 
>> the results are as follows
>>
>>  processors        min        eff    mc    eff    grid    eff
>>  1        18    100    80    100    134    100
>>  2        9    100
>>  4        5    90
>>  8        3    75
>>  16        1    112.5
>>  32        1    56.25    8    31.25    4    104.6
>>
>>
>>  and the picture that speaks 1000 words
>>
>>
>>
>>  key top graph black line achieved runtimes
>>          top graph red line expected runtimes with perfect scaling 
>> efficency
>>          bottom graph scaling efficiency
>>
>>  some notes
>>
>>
>>  0. data was collected on one of chris's small data sets containing 
>> 28 residues not all of which are active for minimisation columns
>>          processors     - no slave  mpi processors
>>          min                    - time for a minimisation of models 
>> m1-m9 with a fixed diffusion tensor
>>          eff                     - approximate parallel efficiency 
>> expected runtime/ actual runtime
>
>
> It would be interesting to see if the efficiencies all converge to
> 100% when a larger number of spin systems are minimised.  Maybe
> duplicating the data a number of times creating an artificially large
> protein would be useful in that regard.


good idea

>
>
>>          mc                     - 256 monte carlo calculations
>>          eff                     - efficiency of the above
>>          grid                   - a grid search on a anisotropic 
>> diffusion tensor 6 steps
>
>
> Do you mean the spheroid (axially symmetric) or the ellipsoid, as both
> are anisotropic?  


fully anisotropic

> I would recommend increasing the number of steps in
> this grid search if MPI is running.

yes thats true but this number of steps is good for testing ;-)

> With that type of scaling
> efficiency, I would recommend 11 or 21 increments per dimension on a
> 32 processor cluster.  A drop from 134 min to 4 min is huge!

indeed (I am only allowed to use 40 processors but may try and sneak in 
a large run when I really go for some scaling measurements)

>
>
>>          eff                     - efficency of the above
>>       tests were run on a cluster of opterons using gigabit ethernet 
>> and mpi
>>  1. these results are crude wall times as measured by pythons 
>> time.time function for the master but they do not include startup and 
>> shutdown overhead
>
>
> They should be more than accurate enough for these types of comparison.
>
I agree but I just wanted to know what methodolgy I am using

>
>>  2. these tests are single point measurements there are no statistics
>
>
> For now statistics are unnecessary.
>
>
>>  3. timings were rounded to 1 second, so for example we must consider 
>> data points for  more than 16 processors for the min run to be suspect
>
>
> That would explain the 56% efficiency for 32 processors.
>

indeed

>
>
>>  The results also highlight up some interesting considerations
>>
>>  1 our local cluster has very poor disk io, with the result that when 
>> i first ran the calculations I saw no multiprocessor imporvements on 
>> the min run (in actual fact it got worse!) I got round this for this 
>> crude test by switching off virtually all text output from the 
>> various minimisation commands. Now obviously this isn't a long term 
>> solution but I can thing of other methods e.g  using an output thread 
>> thread on the master or output batching that would improve these 
>> results.
>
>
> Both options sound good.  This type of threading is not very
> complicated, although debugging blocked threads is hell.  Sending the
> minimisation print out in one hit at the end would be very useful as
> well.
>
indeed

>
>>  2. comparison of the results from the grid calculation and the other 
>> calculations are quite informative. clearly the grid results are 
>> excellent. I believe this is because I am returning individual 
>> subtask results to the master as they complete and the  resulting 
>> overhead due to waiting for the master is a problem. To make this 
>> (clearer?) here is an example: in  the case of the mc run I will take 
>> the 256 mc runs and distribute a batch of 8 to each processor (in the 
>> case of a 32 processor run) I then resturn the results individually 
>> as they complete I believe this can lead to access to the master 
>> being the bottleneck (this is most probably due to output ovrehead on 
>> stdout   again, though problems with contention due to coherence of 
>> the calculation length could also be a problem ).
>
>
> In the Monte Carlo simulations, all of the output of the minimisations
> is suppressed.  Therefore the sending of the minimisation print out
> shouldn't be the issue as nothing needs to be sent.  There must be
> something else at play.  Finding out what this is exactly is important
> before an investment into the threading or batching is made.
>
that true I will have to look

>> In the case of the grid there are no subtasks  as the grid is almost 
>> ideally sub divided by processor so only one task is run on each slave.
>
>
> Do the MC sims have the same scaling efficiency if only one simulation
> is sent to each processor at once?  Does the efficiency increase or
> decrease?
>
> don't know
>
>> I can see at least two answers to this. One is to batch the return of 
>> results so all results get returned at once and the second is to have 
>> an output thread on  the master separate from the thread ervicing mpi 
>> calls so processing of  returned data doesn't block the master and 
>> thus the rest of the cluster.
>
>
> As I mentioned above, both would be useful.  However it would be good
> to know which will be the most beneficial for increasing efficiency
> before implementation.  It could be that the one or the other will not
> result in any significant improvements.
>
I can impliment both as tests fairly quickly and that maybe easier than 
trying to figure it out mentally or with tracing

>
>>  I have some more comments to follow on the design of the current 
>> minimisation interface, how text output from the commands is 
>> controlled, and unit testing but these will have to follow in another 
>> message later on
>>
>>
>>  regards
>>  gary
>>
>>  n.b. if the pciture doesn't dsiplay well my apolergi
>
>
> The picture at 
> https://mail.gna.org/public/relax-devel/2007-04/msg00048.html
> displays perfectly.  I've been thinking about how you could release an
> MPI relax version prior to the merging of a patch into the 1.3 line.
> You could release a relax-1.3.0-gt version (gt for Gary Thompson).
> This could itself have a few versions associated with it (I don't know
> how they would be called though).  What do you think Gary?



sounds good to me

>
> Cheers,
>
> Edward
>
> .
>


-- 
-------------------------------------------------------------------
Dr Gary Thompson
Astbury Centre for Structural Molecular Biology,
University of Leeds, Astbury Building,
Leeds, LS2 9JT, West-Yorkshire, UK             Tel. +44-113-3433024
email: garyt@domain.hid                   Fax  +44-113-2331407
-------------------------------------------------------------------




[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic