[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linaro-mm-sig
Subject:    Re: [Linaro-mm-sig] [Mesa-dev] [PATCH 01/11] drm/amdgpu: Comply with implicit fencing rules
From:       Daniel Vetter <daniel () ffwll ! ch>
Date:       2021-05-26 13:51:45
Message-ID: CAKMK7uFQczzpkdSLB1H-dVySGTiap2ONETZCaz5ErE1sca8YWQ () mail ! gmail ! com
[Download RAW message or body]

On Wed, May 26, 2021 at 3:32 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> 
> Am 25.05.21 um 17:23 schrieb Daniel Vetter:
> > On Tue, May 25, 2021 at 5:05 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Hi Daniel,
> > > 
> > > Am 25.05.21 um 15:05 schrieb Daniel Vetter:
> > > > Hi Christian,
> > > > 
> > > > On Sat, May 22, 2021 at 10:30:19AM +0200, Christian König wrote:
> > > > > Am 21.05.21 um 20:31 schrieb Daniel Vetter:
> > > > > This works by adding the fence of the last eviction DMA operation to BOs
> > > > > when their backing store is newly allocated. That's what the
> > > > > ttm_bo_add_move_fence() function you stumbled over is good for: \
> > > > > https://elixir.bootlin.com/linux/v5.13-rc2/source/drivers/gpu/drm/ttm/ttm_bo.c#L692
> > > > >  
> > > > > Now the problem is it is possible that the application is terminated before
> > > > > it can complete it's command submission. But since resource management only
> > > > > waits for the shared fences when there are some there is a chance that we
> > > > > free up memory while it is still in use.
> > > > Hm where is this code? Would help in my audit that I wanted to do this
> > > > week? If you look at most other places like
> > > > drm_gem_fence_array_add_implicit() I mentioned earlier, then we don't
> > > > treat the shared fences special and always also include the exclusive one.
> > > See amdgpu_gem_object_close():
> > > 
> > > ...
> > > fence = dma_resv_get_excl(bo->tbo.base.resv);
> > > if (fence) {
> > > amdgpu_bo_fence(bo, fence, true);
> > > fence = NULL;
> > > }
> > > ...
> > > 
> > > We explicitly added that because resource management of some other
> > > driver was going totally bananas without that.
> > > 
> > > But I'm not sure which one that was. Maybe dig a bit in the git and
> > > mailing history of that.
> > Hm I looked and it's
> > 
> > commit 82c416b13cb7d22b96ec0888b296a48dff8a09eb
> > Author: Christian König <christian.koenig@amd.com>
> > Date:   Thu Mar 12 12:03:34 2020 +0100
> > 
> > drm/amdgpu: fix and cleanup amdgpu_gem_object_close v4
> > 
> > That sounded more like amdgpu itself needing this, not another driver?
> 
> No, that patch was just a follow up moving the functionality around.

That patch added the "add exclusive fence to shared slots before
amdgpu_vm_clear_freed() call", which I thought was at least part of
your fix.

> > But looking at amdgpu_vm_bo_update_mapping() it seems to pick the
> > right fencing mode for gpu pte clearing, so I'm really not sure what
> > the bug was that you worked around here?The implementation boils down
> > to amdgpu_sync_resv() which syncs for the exclusive fence, always. And
> > there's nothing else that I could find in public history at least, no
> > references to bug reports or anything. I think you need to dig
> > internally, because as-is I'm not seeing the problem here.
> > 
> > Or am I missing something here?
> 
> See the code here for example:
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/nouveau/nouveau_fence.c#L361
>  
> Nouveau assumes that when a shared fence is present it doesn't need to
> wait for the exclusive one because the shared are always supposed to
> finish after the exclusive one.
> 
> But for page table unmap fences that isn't true and we ran into a really
> nasty and hard to reproduce bug because of this.
> 
> I think it would be much more defensive if we could say that we always
> wait for the exclusive fence and fix the use case in nouveau and double
> check if somebody else does stuff like that as well.

Yeah most other drivers do the defensive thing here. I think it would
be good to standardize on that. I'll add that to my list and do more
auditing.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Linaro-mm-sig mailing list
Linaro-mm-sig@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-mm-sig


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic