'Re: Solr 7.6.0 - won't elect leader'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: Solr 7.6.0 - won't elect leader
From:       Joe Obernberger <joseph.obernberger () gmail ! com>
Date:       2019-05-30 15:02:45
Message-ID: 689d3c43-8e3b-4a5c-cbf1-8e6cd42430ce () gmail ! com
[Download RAW message or body]

Thank you Walter.  I ended up dropping the collection.  We have two 
primary collections - one is all the data (100 shards, no replicas), and 
one is 30 days of data (40 shards, 3 replicas each).  We hardly ever 
have any issues with the collection with no replicas.  I tried bringing 
down the nodes several times.  I then updated the zookeeper node and put 
the necessary information into it with a leader selected.  Then I 
restarted the nodes again - no luck.

-Joe

On 5/30/2019 10:42 AM, Walter Underwood wrote:
> We had a 6.6.2 prod cluster get into a state like this. It did not have an \
> overseer, so any command just sat in the overseer queue. After I figured that out, \
> I could see a bunch of queued stuff in the tree view under /overseer. That included \
> an ADDROLE command to set an overseer. Sigh. 
> Fixed it by shutting down all the nodes, then bringing up one. That one realized \
> there was no overseer and assumed the role. Then we brought up the rest of the \
> nodes. 
> I do not know how it got into that situation. We had some messed up networking \
> conditions where I could HTTP from node A to port 8983 on node B, but it would hang \
> when I tried that from B to A. This is all in AWS. 
> Yours might be different.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> > On May 30, 2019, at 5:47 AM, Joe Obernberger <joseph.obernberger@gmail.com> \
> > wrote: 
> > More info - looks like a zookeeper node got deleted somehow.
> > NoNode for
> > /collections/UNCLASS_30DAYS/leaders/shard31/leader
> > 
> > I then made that node using solr zk mkroot, and now I get the error:
> > 
> > > org.apache.solr.common.SolrException: Error getting leader from zk for shard \
> > > shard31
> > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1299)
> > at org.apache.solr.cloud.ZkController.register(ZkController.java:1150)
> > at org.apache.solr.cloud.ZkController.register(ZkController.java:1081)
> > at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:187)
> > at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> >  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >  at java.lang.Thread.run(Thread.java:748)
> > Caused by: org.apache.solr.common.SolrException: Could not get leader props
> > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1346)
> > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1310)
> > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1266)
> > ... 7 more
> > Caused by: java.lang.NullPointerException
> > at org.apache.solr.common.util.Utils.fromJSON(Utils.java:239)
> > at org.apache.solr.common.cloud.ZkNodeProps.load(ZkNodeProps.java:92)
> > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1328)
> > ... 9 more
> > 
> > Can I manually enter information for the leader? How would I get that?
> > 
> > -Joe
> > 
> > On 5/30/2019 8:39 AM, Joe Obernberger wrote:
> > > Hi All - I have a 40 node cluster that has been running great for a long while, \
> > > but it all came down due to OOM.  I adjusted the parameters and restarted, but \
> > > one shard with 3 replicas (all NRT) will not elect a leader.  I see messages \
> > > like: 
> > > 2019-05-30 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] \
> > > o.a.s.c.SyncStrategy Sync replicas to \
> > > http://elara:9100/solr/UNCLASS_30DAYS_shard31_replica_n182/ 2019-05-30 \
> > > 12:35:30.597 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 \
> > > r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync \
> > > PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr \
> > > START replicas=[http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/, \
> > > http://rosalind:9100/solr/UNCLASS_30DAYS_shard31_replica_n184/] nUpdates=100 \
> > > 2019-05-30 12:35:30.651 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] \
> > > o.a.s.u.PeerSync PeerSync: core=UNCLASS_30DAYS_shard31_replica_n182 \
> > > url=http://elara:9100/solr  Received 100 versions from \
> > > http://enceladus:9100/solr/UNCLASS_30DAYS_shard31_replica_n180/ \
> > > fingerprint:null 2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) \
> > > [c:UNCLASS_30DAYS s:shard31 r:core_node185 \
> > > x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: \
> > > core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr  Our \
> > > versions are too old. ourHighThreshold=1634891841359839232 \
> > > otherLowThreshold=1634892098551414784 ourHighest=1634892003501146112 \
> > > otherHighest=1634892708023631872 2019-05-30 12:35:30.652 INFO  \
> > > (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 r:core_node185 \
> > > x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.u.PeerSync PeerSync: \
> > > core=UNCLASS_30DAYS_shard31_replica_n182 url=http://elara:9100/solr DONE. sync \
> > > failed 2019-05-30 12:35:30.652 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] \
> > > o.a.s.c.SyncStrategy Leader's attempt to sync with shard failed, moving to the \
> > > next candidate 2019-05-30 12:35:30.683 INFO  (zkCallback-7-thread-3) \
> > > [c:UNCLASS_30DAYS s:shard31 r:core_node185 \
> > > x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.ShardLeaderElectionContext There \
> > > may be a better leader candidate than us - going back into recovery 2019-05-30 \
> > > 12:35:30.693 INFO  (zkCallback-7-thread-3) [c:UNCLASS_30DAYS s:shard31 \
> > > r:core_node185 x:UNCLASS_30DAYS_shard31_replica_n182] \
> > > o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader \
> > > parent node, won't remove previous leader registration. 2019-05-30 12:35:30.694 \
> > > WARN (updateExecutor-3-thread-4-processing-n:elara:9100_solr \
> > > x:UNCLASS_30DAYS_shard31_replica_n182 c:UNCLASS_30DAYS s:shard31 \
> > > r:core_node185) [c:UNCLASS_30DAYS s:shard31 r:core_node185 \
> > > x:UNCLASS_30DAYS_shard31_replica_n182] o.a.s.c.RecoveryStrategy Stopping \
> > > recovery for core=[UNCLASS_30DAYS_shard31_replica_n182] \
> > > coreNodeName=[core_node185] 
> > > and
> > > 
> > > 2019-05-30 12:25:39.522 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] \
> > > o.a.s.c.ActionThrottle Throttling leader attempts - waiting for 136ms \
> > > 2019-05-30 12:25:39.672 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] \
> > > o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with \
> > > higher term participated in leader election 2019-05-30 12:25:39.672 INFO  \
> > > (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 r:core_node187 \
> > > x:UNCLASS_30DAYS_shard31_replica_n184] o.a.s.c.ShardLeaderElectionContext There \
> > > may be a better leader candidate than us - going back into recovery 2019-05-30 \
> > > 12:25:39.677 INFO  (zkCallback-7-thread-1) [c:UNCLASS_30DAYS s:shard31 \
> > > r:core_node187 x:UNCLASS_30DAYS_shard31_replica_n184] \
> > > o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader \
> > > parent node, won't remove previous leader registration. 
> > > and
> > > 
> > > 2019-05-30 12:26:39.820 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS \
> > > s:shard31 r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] \
> > > o.a.s.c.ShardLeaderElectionContext Can't become leader, other replicas with \
> > > higher term participated in leader election 2019-05-30 12:26:39.820 INFO  \
> > > (zkCallback-7-thread-5) [c:UNCLASS_30DAYS s:shard31 r:core_node183 \
> > > x:UNCLASS_30DAYS_shard31_replica_n180] o.a.s.c.ShardLeaderElectionContext There \
> > > may be a better leader candidate than us - going back into recovery 2019-05-30 \
> > > 12:26:39.826 INFO  (zkCallback-7-thread-5) [c:UNCLASS_30DAYS s:shard31 \
> > > r:core_node183 x:UNCLASS_30DAYS_shard31_replica_n180] \
> > > o.a.s.c.ShardLeaderElectionContextBase No version found for ephemeral leader \
> > > parent node, won't remove previous leader registration. 
> > > I've tried FORCELEADER, but it had no effect.  I also tried adding a shard, but \
> > > that one didn't come up either.  The index is on HDFS. 
> > > Help!
> > > 
> > > -Joe
> > > 
> 
> 
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
> 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic