-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix NULL pointer dereference #36
Conversation
Add tracking of open file descriptors in the tenstorrent_device structure using a list of chardev_private nodes.
What happens to mmapped DMA buffers when dma_free_coherent is called? The mapping still exists, what's backing it? |
It looks like mmap increases the refcounts of the pages, so the buffer survives. |
enumerate.c
Outdated
@@ -116,6 +119,13 @@ static int tenstorrent_pci_probe(struct pci_dev *dev, const struct pci_device_id | |||
static void tenstorrent_pci_remove(struct pci_dev *dev) | |||
{ | |||
struct tenstorrent_device *tt_dev = pci_get_drvdata(dev); | |||
struct chardev_private *priv, *tmp; | |||
|
|||
mutex_lock(&tt_dev->chardev_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tenstorrent_memory_cleanup also takes chardev_mutex but mutexes don't allow recursive locking. (https://docs.kernel.org/locking/mutex-design.html#semantics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that was in uncommitted code. There's still a problem. tenstorrent_pci_remove calls tenstorrent_memory_cleanup (holding tt_dev-scoped mutex). This can race against:
- close fd -> tt_cdev_release -> tenstorrent_memory_cleanup (no locking because it's called by the last owner)
- ioctl_allocate_dma_buf (only holds chardev_priv->mutex)
- ioctl_pin_pages & ioctl_unpin_pages (chardev_priv->mutex)
- ioctl_map_peer_bar (own chardev_priv->mutex and peer's chardev_priv->mutex)
Probably we should just take the chardev_priv mutex in tenstorrent_memory_cleanup. So far no code holds both mutexes at the same time.
4317634
to
2669790
Compare
When a PCIe device is removed while file descriptors are still open, ensure all DMA resources are cleaned up before the IOMMU mappings become imvalid. Part of the intent is to avoid a NULL pointer dereference at fd close when the IOMMU is enabled. See #34 for details.
2669790
to
c5469f5
Compare
An attempt to fix #34.
#34 says, "This presumably also affects pinned system memory" -- yes, I've run into this a couple of times.
I am allocating and mapping/pinning a large buffer with KMD and making it accessible to an L2CPU. If I forget to quit the userspace process that does this before removing the card from the PCI bus, I get a similar NULL pointer dereference when I do kill the process.
I am removing the card from the bus so I can power it off when I'm not using it, while leaving the host system online. (The particular card I am working with is in an eGPU dock with its own PSU).
echo 1 | sudo tee /sys/bus/pci/devices/dddd:BB:DD:F/remove
andecho 1 | sudo tee /sys/bus/pci/rescan
is a convenient mechanism for this.