[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-issues
Subject:    [jira] [Commented] (MESOS-10192) Recent Nvidia CUDA changes break Mesos GPU support
From:       "Qian Zhang (Jira)" <jira () apache ! org>
Date:       2020-10-13 2:34:00
Message-ID: JIRA.13330933.1601783387000.34506.1602556440048 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/MESOS-10192?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212793#comment-17212793 ] 

Qian Zhang commented on MESOS-10192:
------------------------------------

commit 301902be4f1332799cf3b3242cd29b4907c21c09
Author: Qian Zhang 
Date: Sat Oct 10 15:04:57 2020 +0800

Ignored the directoy `/dev/nvidia-caps` when globing Nvidia GPU devices.
 
 The directory `/dev/nvidia-caps` was introduced in CUDA 11.0, just
 ignore it since we only care about the Nvidia GPU device files.
 
 Review: https://reviews.apache.org/r/72945

> Recent Nvidia CUDA changes break Mesos GPU support
> --------------------------------------------------
> 
> Key: MESOS-10192
> URL: https://issues.apache.org/jira/browse/MESOS-10192
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization, gpu
> Reporter: Greg Mann
> Assignee: Qian Zhang
> Priority: Major
> Labels: GPU, containerization, containerizer, gpu
> 
> Recently it seems that the layout of the Nvidia device files has changed:  \
> https://docs.nvidia.com/datacenter/tesla/mig-user-guide/ This prevents GPU tasks \
> from launching: {noformat}
> W0929 17:27:21.002178 65691 http.cpp:3436] Failed to launch container \
> c08e1fc7-53c4-427e-a1a1-85b770e77d69.738440a3-f4cc-42ce-8978-418ba0011160: Failed \
> to copy device '/dev/nvidia-caps': Failed to get source dev: Not a special file: \
> /dev/nvidia-caps {noformat}
> due to this code, which detects the nvidia device files: \
> https://github.com/apache/mesos/blob/8700dd8d5ece658804d7b7a40863800dcc5c72bc/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp#L438-L454
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic