[prev in list] [next in list] [prev in thread] [next in thread]
List: npaci-rocks-discussion
Subject: [Rocks-Discuss] Issue with rocks ssh/reinstall/insert-ethers
From: Korey R <r.korey () gmail ! com>
Date: 2018-09-18 20:12:44
Message-ID: CAFn2yqzow9aE1D4XtG3TpxTBJORnMSOr+64f2eH3n3piFhh34A () mail ! gmail ! com
[Download RAW message or body]
Hello All,
I have taken over the duties of a Rocks group cluster. I successfully
installed Rocks Cluster 7 with CentOS 7.4 which included installing 4
different node hardward configurations (GPUs, amd cpu, intel cpu, old and
new) via PXE. We had a scheduled power outage Monday and Tuesday morning
(today). After shutting down the nodes for the power outage and booting the
system, followed by the compute nodes, I was unable to qlogin or qsub jobs
to all of the nodes. When attempting to ssh to the nodes from my account
user I was prompted with a request for my password. This is the same for
all other users excluding root. After entering a users respective password
I am taken to a general bash prompt with the error, "Could not chdir to
home directory /home/user: No such file or directory." The messages log
(/var/log/messages) on all nodes following login yields:
Sep 18 15:34:31 compute-1-0 systemd: Starting Session 8 of user user.
Sep 18 15:35:29 compute-1-0 automount[1974]: add_host_addrs: hostname
lookup failed: Name or service not known
When I check the /var/log/secure log corresponding to my user login I find:
Sep 18 15:34:13 compute-1-0 sshd[4965]: userauth_hostbased mismatch: client
sends frontend.local, but we resolve 10.1.1.1 to gateway
However, when I log into the node as root I find:
Sep 18 15:35:50 compute-1-0 sshd[5181]: Accepted publickey for root from
10.1.1.1 port 60314 ssh2: RSA
SHA256:*****************************************************************
Sep 18 15:35:50 compute-1-0 sshd[5181]: pam_unix(sshd:session): session
opened for user root by (uid=0) (Here I commented out SHA256 key)
To remedy this I pursued reinstalling the node to no avail and attempted to
force reinstalling the node by:
insert-ethers --remove compute-0-0
insert-ethers --rank 0 --rack 0
On the node PXE starts fine, gets a response from the headnode and enters
TFTP transfer and exits with "no DHCP response"
While on the headnode I see under /var/log/messages
Sep 18 12:44:22 torrey dhcpd: DHCPDISCOVER from 00:30:48:bd:89:f6 via
enp3s0: network 10.1.0.0/16: no free leases
From reading the mailing list I looked at syslog daemon via systemctl
status rsyslog.service -l and found it not running, attempted a restart and
rsyslog.service refused to restart/stop-start/etc. so I reinstalled rsyslog
through yum. After a reboot of the headnode the status of this service is
running without errors but did not fix my issue.
I then attempted to create new ssh-key assuming the key corresponding to
the user may be incorrect followed by:
rocks sync config
rocks sync user
with no apparent change.
I attempted restart nfs which does so without issue:
systemctl restart nfs
then I ran make after removing 411.mk file
cd /var/411
rm 411.mk
make -C force
rocks run host "411get --all"
which reported:
Error: Could not reach a master server. Masters: [http://10.1.1.1:372/411.d/
(-1)]
sge-qmaster appears to be running via ps -A |grep sge -> sge_qmaster
which is supported by systemctl status sgemaster.headnode.service being
active (running).
Even more frustrating is my inability to grasp what has gone wrong,
especially in the case that all compute nodes will not automount
/share/apps or /export/home -> /home/ which has halted all research.
Thank you in advance for help with this issue, I am hoping I can learn more
about the cluster and fix this issue without reinstalling the entire
cluster...
With my best,
Korey
PS here is the output of:
rocks list network:
NETWORK SUBNET NETMASK MTU DNSZONE SERVEDNS
private: 10.1.0.0 255.255.0.0 1500 local True
public: xxx.xxx.xxx.xxx 255.255.255.0 1500 rd.unr.edu False
rocks list host:
HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
headnode: Frontend 16 0 0 os install
compute-1-1: Compute 32 1 1 os install
compute-2-0: Compute 2 2 0 os install
compute-0-1: Compute 8 0 1 os install
compute-0-2: Compute 8 0 2 os install
compute-0-3: Compute 8 0 3 os install
compute-0-4: Compute 8 0 4 os install
compute-0-5: Compute 1 0 5 os install
compute-1-4: Compute 1 1 4 os install
compute-1-5: Compute 1 1 5 os install
compute-2-1: Compute 8 2 1 os install
compute-1-0: Compute 32 1 0 os install
compute-0-0: Compute 1 0 0 os install
rocks list host interface front_end:
SUBNET IFACE MAC IP NETMASK MODULE NAME
VLAN OPTIONS CHANNEL
private enp3s0 E0:3F:49:E6:84:BE 10.1.1.1 255.255.0.0 ------
headnode ---- ------- -------
public enp2s0 E0:3F:49:E6:84:BD 134.197.32.249 255.255.255.0 ------
headnode ---- ------- -------
rocks list host interface compute-1-0:
SUBNET IFACE MAC IP NETMASK MODULE
NAME VLAN OPTIONS CHANNEL
private enp1s0f0 2c:fd:a1:c7:80:a6 10.1.255.245 255.255.0.0 ------
compute-1-0 ---- ------- -------
------- enp1s0f1 2c:fd:a1:c7:80:a7 ------------ ----------- ------
----------- ---- ------- -------
more /etc/fstab from headnode:
#
# /etc/fstab
# Created by anaconda on Sun Jul 15 20:27:09 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=7c09a45c-d28b-4fdc-a109-2785a394c691 / ext4
defaul
ts 1 1
UUID=9c5cf285-8394-444b-a846-35756119be46 /boot ext4
defaul
ts 1 2
UUID=a2217370-4f00-40dc-bda3-84b325e5ad8d /export xfs
defaul
ts 0 0
UUID=3338cc55-37c8-47b6-a7aa-90d1bece17ad swap swap
defaul
ts 0 0
# The ram-backed filesystem for ganglia RRD graph databases.
tmpfs /var/lib/ganglia/rrds tmpfs
size=8231786000,gid=nobody,uid=nobody,defaults
1 0
Hope this is an adequate level of information to get off on the right track
> )
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20180918/f3ccb957/attachment.html \
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic