'Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr crash'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-crypto-vger
Subject:    Re: [PATCH v5.10.y] crypto: caam/jr - Fix possible caam_jr crash
From:       Horia Geanta <horia.geanta () nxp ! com>
Date:       2024-01-31 13:34:17
Message-ID: 5be20cd5-8758-452c-81a9-6452df7aedeb () nxp ! com
[Download RAW message or body]

On 12/21/2023 11:32 AM, Kun Song wrote:
> Test environment:
> Linux kernel version: 5.10.y
> Architecture: ARM Cortex-A
> Processor: NXP Layerscape LS1028
> 
> Crash in reboot tests:
> Reproducibility: 1%
> 
Replying here to comment on the log.
I've added all the people from the latest reply, i.e.:
https://lore.kernel.org/linux-crypto/AM0PR04MB6004AECDD044F1E6BF3B6732E7682@AM0PR04MB6004.eurprd04.prod.outlook.com


> If a job ring is still allocated, Once caam_jr_remove() returned,
> jrpriv will be freed and the registers will get unmapped.Then
In this case, most likely the root cause is different (see below).

> caam_jr_interrupt will get error irqstate value.
> So such a job ring will probably crash.Crash info is below:
> --------------------------------------
> RBS Sys: Restart ordered by epghd(0x1)
> RBS Sys: RESTARTING
This looks like a system restart.

> caam_jr 8030000.jr: Device is busy
> caam_jr 8020000.jr: Device is busy
> caam_jr 8010000.jr: Device is busy
For some reason, there are still tfms accounted for on all these caam job rings.
Maybe the system restart is not handled correctly at some point,
hence some resource leaks (unallocated tfms).

As already discussed, exiting early from caam_jr .remove() callback will leave
the HW unquiesced (e.g. job rings not flushed), interrupts still active.

> arm-smmu 5000000.iommu: disabling translation
From here onward caam memory accesses won't be translated.

> caam_jr 8010000.jr: job ring error: irqstate: 00000103
The error code means caam was not able to write the status of the job
it just finished in the output job ring (which is allocated in memory).

Most likely this happened due to iommu no longer translating the access.

> 
> Disabling interrupts is to ensure that the device removal
> operation is not interrupted.
> 
> Signed-off-by: Kun Song <Kun.Song@windriver.com>
> Reviewed-by: Hen Guo <Heng.Guo@windriver.com>
> ---
> drivers/crypto/caam/jr.c | 4 ++++
> 1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c
> index 6f669966ba2c..d191e8caa1ad 100644
> --- a/drivers/crypto/caam/jr.c
> +++ b/drivers/crypto/caam/jr.c
> @@ -135,6 +135,10 @@ static int caam_jr_remove(struct platform_device *pdev)
> 	jrdev = &pdev->dev;
> 	jrpriv = dev_get_drvdata(jrdev);
> 
> +	/* Disabling interrupts is ensure that the device removal operation
> +	 * is not interrupted by interrupts.
> +	 */
> +	devm_free_irq(jrdev, jrpriv->irq, jrdev);
> 	if (jrpriv->hwrng)
> 		caam_rng_exit(jrdev->parent);
> 
As pointed out by previous discussions, this is not enough.
Crashes could still occur due to crypto API users calling into caam driver,
which would be in an inconsistent state.

Whether .remove() being called is triggered by a system restart or
a manual device unbinding [1] is irrelevant, the driver mustn't crash.

I think Herbert's suggestion [2] on how to deal with HW devices going away
makes sense.

Not sure if the changes (thinking of crypto API part) could go into LTS kernels.
If not, we'll have to stick to caam driver-only changes (but IIUC
other crypto drivers are having the same issue with HW devices going away [3]).

Thanks,
Horia

[1] https://lore.kernel.org/linux-crypto/VI1PR04MB7023A7EC91599A537CB6A487EEB30@VI1PR04MB7023.eurprd04.prod.outlook.com/
 [2] https://lore.kernel.org/linux-crypto/20190919134512.GA29320@gondor.apana.org.au/
[3] https://lore.kernel.org/linux-crypto/ZAqwTqw3lR+dnImO@gondor.apana.org.au/


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic