[prev in list] [next in list] [prev in thread] [next in thread] 

List:       grid-engine-dev
Subject:    Re: [GE dev] Qselect problem with sge6?
From:       Andy Schwierskott <andy.schwierskott () sun ! com>
Date:       2004-07-20 9:52:06
Message-ID: Pine.SOC.4.60.0407201151120.2368 () sr-ergb01-01
[Download RAW message or body]

Jeff,

I assume you have a core (and can get one) - could you please attach the stack strace \
to the Issue?

Thanks,
Andy

On Mon, 19 Jul 2004, Beadles, Jeff wrote:

> 
> It must be strange problem day for me, but if I run;
> $ qselect -l arch=sol-sparc64
> critical error: !!!!!!!!!! lGetList(): got NULL element for EH_load_list !!!!!!!
> !!!
> Aborted
> 
> However this works fine:
> 
> $ qhost -l arch=sol-sparc64
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -       -
> b2500                   sol-sparc64  1.00  0.00    4.0G  420.0M    4.0G    1.0M
> bertha                  sol-sparc64  8.00  0.02   16.0G    2.3G   32.0G     0.0
> blakswan                sol-sparc64  4.00  0.03    4.0G    1.5G    4.0G   49.0M
> devalssw                sol-sparc64  2.00  0.19    1.0G  713.0M    2.0G  144.0M
> lab240                  sol-sparc64  2.00  0.00    4.0G  498.0M    2.0G     0.0
> matebert                sol-sparc64  1.00     -    2.0G       -    4.0G       -
> mirror                  sol-sparc64  1.00  0.04  512.0M  136.0M    2.0G   28.0M
> scotch                  -               -     -       -       -       -       -
> srsc101                 sol-sparc64  4.00  0.01    4.0G  470.0M    8.9G     0.0
> ss5svr1                 sol-sparc64  2.00  0.05    4.0G  643.0M    4.0G     0.0
> ...
> 
> I believe that this has to do with hosts that are down, and that haven't reported \
> their load/config information. 
> I've taken a look at the abort, and it looks like the code is abort()ing rather \
> than returning 0 matches. 
> In particular, in libs/cull/cull_multitype.c, there are several places with code \
> like: 
> const char *lGetHost(const lListElem *ep, int name)
> {
> int pos;
> DENTER(CULL_BASIS_LAYER, "lGetHost");
> 
> if (!ep) {
> CRITICAL((SGE_EVENT, MSG_CULL_POINTER_GETHOST_NULLELEMENTFORX_S ,
> lNm2Str(name)));
> DEXIT;
> abort();
> }
> ...
> 
> I think that it should read:
> const char *lGetHost(const lListElem *ep, int name)
> {
> int pos;
> DENTER(CULL_BASIS_LAYER, "lGetHost");
> 
> if (!ep) {
> return(NULL);
> }
> 
> There are 3 places that this shows up, in lGetHost(), lGetList(), and lGetSubStr()
> 
> Shouldn't it be returning 'no matches' rather than dumping core?
> 
> Attached is an updated version of cull_multitype.c, that I would appreciate someone \
> with source access reviewing & checking in. 
> We've been running with these changes for a year on V5, and now with V6.
> 
> Regards,
> 	-Jeff


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic