[prev in list] [next in list] [prev in thread] [next in thread] 

List:       npaci-rocks-discussion
Subject:    [Rocks-Discuss]  Issue with rocks ssh/reinstall/insert-ethers
From:       Korey R <r.korey () gmail ! com>
Date:       2018-09-18 20:12:44
Message-ID: CAFn2yqzow9aE1D4XtG3TpxTBJORnMSOr+64f2eH3n3piFhh34A () mail ! gmail ! com
[Download RAW message or body]

Hello All,

I have taken over the duties of a Rocks group cluster. I successfully
installed Rocks Cluster 7 with CentOS 7.4 which included installing 4
different node hardward configurations (GPUs, amd cpu, intel cpu, old and
new) via PXE. We had a scheduled power outage Monday and Tuesday morning
(today). After shutting down the nodes for the power outage and booting the
system, followed by the compute nodes, I was unable to qlogin or qsub jobs
to all of the nodes. When attempting to ssh to the nodes from my account
user I was prompted with a request for my password. This is the same for
all other users excluding root. After entering a users respective password
I am taken to a general bash prompt with the error, "Could not chdir to
home directory /home/user: No such file or directory." The messages log
(/var/log/messages) on all nodes following login yields:
Sep 18 15:34:31 compute-1-0 systemd: Starting Session 8 of user user.
Sep 18 15:35:29 compute-1-0 automount[1974]: add_host_addrs: hostname
lookup failed: Name or service not known

When I check the /var/log/secure log corresponding to my user login I find:
Sep 18 15:34:13 compute-1-0 sshd[4965]: userauth_hostbased mismatch: client
sends frontend.local, but we resolve 10.1.1.1 to gateway

However, when I log into the node as root I find:
Sep 18 15:35:50 compute-1-0 sshd[5181]: Accepted publickey for root from
10.1.1.1 port 60314 ssh2: RSA
SHA256:*****************************************************************
Sep 18 15:35:50 compute-1-0 sshd[5181]: pam_unix(sshd:session): session
opened for user root by (uid=0) (Here I commented out SHA256 key)

To remedy this I pursued reinstalling the node to no avail and attempted to
force reinstalling the node by:
insert-ethers --remove compute-0-0
insert-ethers --rank 0 --rack 0

On the node PXE starts fine, gets a response from the headnode and enters
TFTP transfer and exits with "no DHCP response"

While on the headnode I see under /var/log/messages
Sep 18 12:44:22 torrey dhcpd: DHCPDISCOVER from 00:30:48:bd:89:f6 via
enp3s0: network 10.1.0.0/16: no free leases

From reading the mailing list I looked at syslog daemon via systemctl
status rsyslog.service -l and found it not running, attempted a restart and
rsyslog.service refused to restart/stop-start/etc. so I reinstalled rsyslog
through yum. After a reboot of the headnode the status of this service is
running without errors but did not fix my issue.

I then attempted to create new ssh-key assuming the key corresponding to
the user may be incorrect followed by:
rocks sync config
rocks sync user
with no apparent change.
I attempted restart nfs which does so without issue:
systemctl restart nfs
then I ran make after removing 411.mk file
cd /var/411
rm 411.mk
make -C force
rocks run host "411get --all"
which reported:
Error: Could not reach a master server. Masters: [http://10.1.1.1:372/411.d/
(-1)]

sge-qmaster  appears to be running via ps -A |grep sge -> sge_qmaster
which is supported by systemctl status sgemaster.headnode.service being
active (running).

Even more frustrating is my inability to grasp what has gone wrong,
especially in the case that all compute nodes will not automount
/share/apps or /export/home -> /home/ which has halted all research.
Thank you in advance for help with this issue, I am hoping I can learn more
about the cluster and fix this issue without reinstalling the entire
cluster...

With my best,
Korey

PS here is the output of:
rocks list network:
NETWORK  SUBNET       NETMASK       MTU   DNSZONE    SERVEDNS
private: 10.1.0.0     255.255.0.0   1500  local      True
public:  xxx.xxx.xxx.xxx 255.255.255.0 1500  rd.unr.edu False

rocks list host:
HOST         MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
headnode:      Frontend   16   0    0    os        install
compute-1-1: Compute    32   1    1    os        install
compute-2-0: Compute    2    2    0    os        install
compute-0-1: Compute    8    0    1    os        install
compute-0-2: Compute    8    0    2    os        install
compute-0-3: Compute    8    0    3    os        install
compute-0-4: Compute    8    0    4    os        install
compute-0-5: Compute    1    0    5    os        install
compute-1-4: Compute    1    1    4    os        install
compute-1-5: Compute    1    1    5    os        install
compute-2-1: Compute    8    2    1    os        install
compute-1-0: Compute    32   1    0    os        install
compute-0-0: Compute    1    0    0    os        install

rocks list host interface front_end:
SUBNET  IFACE  MAC               IP             NETMASK       MODULE NAME
VLAN OPTIONS CHANNEL
private enp3s0 E0:3F:49:E6:84:BE 10.1.1.1       255.255.0.0   ------
headnode ---- ------- -------
public  enp2s0 E0:3F:49:E6:84:BD 134.197.32.249 255.255.255.0 ------
headnode ---- ------- -------

rocks list host interface compute-1-0:
SUBNET  IFACE    MAC               IP           NETMASK     MODULE
NAME        VLAN OPTIONS CHANNEL
private enp1s0f0 2c:fd:a1:c7:80:a6 10.1.255.245 255.255.0.0 ------
compute-1-0 ---- ------- -------
------- enp1s0f1 2c:fd:a1:c7:80:a7 ------------ ----------- ------
----------- ---- ------- -------

more /etc/fstab from headnode:

#
# /etc/fstab
# Created by anaconda on Sun Jul 15 20:27:09 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=7c09a45c-d28b-4fdc-a109-2785a394c691 /                       ext4
defaul
ts        1 1
UUID=9c5cf285-8394-444b-a846-35756119be46 /boot                   ext4
defaul
ts        1 2
UUID=a2217370-4f00-40dc-bda3-84b325e5ad8d /export                 xfs
defaul
ts        0 0
UUID=3338cc55-37c8-47b6-a7aa-90d1bece17ad swap                    swap
defaul
ts        0 0

# The ram-backed filesystem for ganglia RRD graph databases.
tmpfs /var/lib/ganglia/rrds tmpfs
size=8231786000,gid=nobody,uid=nobody,defaults
 1 0

Hope this is an adequate level of information to get off on the right track
> )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20180918/f3ccb957/attachment.html \



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic