Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix NULL pointer dereference #36

Merged
merged 2 commits into from
Jan 29, 2025
Merged

Conversation

joelsmithTT
Copy link
Contributor

An attempt to fix #34.

#34 says, "This presumably also affects pinned system memory" -- yes, I've run into this a couple of times.

I am allocating and mapping/pinning a large buffer with KMD and making it accessible to an L2CPU. If I forget to quit the userspace process that does this before removing the card from the PCI bus, I get a similar NULL pointer dereference when I do kill the process.

I am removing the card from the bus so I can power it off when I'm not using it, while leaving the host system online. (The particular card I am working with is in an eGPU dock with its own PSU). echo 1 | sudo tee /sys/bus/pci/devices/dddd:BB:DD:F/remove and echo 1 | sudo tee /sys/bus/pci/rescan is a convenient mechanism for this.

Add tracking of open file descriptors in the tenstorrent_device
structure using a list of chardev_private nodes.
@alewycky-tenstorrent
Copy link
Contributor

What happens to mmapped DMA buffers when dma_free_coherent is called? The mapping still exists, what's backing it?

@joelsmithTT
Copy link
Contributor Author

What happens to mmapped DMA buffers when dma_free_coherent is called? The mapping still exists, what's backing it?

It looks like mmap increases the refcounts of the pages, so the buffer survives.

chardev_private.h Outdated Show resolved Hide resolved
device.h Outdated Show resolved Hide resolved
enumerate.c Outdated
@@ -116,6 +119,13 @@ static int tenstorrent_pci_probe(struct pci_dev *dev, const struct pci_device_id
static void tenstorrent_pci_remove(struct pci_dev *dev)
{
struct tenstorrent_device *tt_dev = pci_get_drvdata(dev);
struct chardev_private *priv, *tmp;

mutex_lock(&tt_dev->chardev_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenstorrent_memory_cleanup also takes chardev_mutex but mutexes don't allow recursive locking. (https://docs.kernel.org/locking/mutex-design.html#semantics)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that was in uncommitted code. There's still a problem. tenstorrent_pci_remove calls tenstorrent_memory_cleanup (holding tt_dev-scoped mutex). This can race against:

  • close fd -> tt_cdev_release -> tenstorrent_memory_cleanup (no locking because it's called by the last owner)
  • ioctl_allocate_dma_buf (only holds chardev_priv->mutex)
  • ioctl_pin_pages & ioctl_unpin_pages (chardev_priv->mutex)
  • ioctl_map_peer_bar (own chardev_priv->mutex and peer's chardev_priv->mutex)

Probably we should just take the chardev_priv mutex in tenstorrent_memory_cleanup. So far no code holds both mutexes at the same time.

@joelsmithTT joelsmithTT force-pushed the joelsmith/34 branch 2 times, most recently from 4317634 to 2669790 Compare January 24, 2025 21:56
When a PCIe device is removed while file descriptors are still open,
ensure all DMA resources are cleaned up before the IOMMU mappings become
imvalid.  Part of the intent is to avoid a NULL pointer dereference at
fd close when the IOMMU is enabled.

See #34 for details.
@joelsmithTT joelsmithTT changed the title Draft: Attempt to fix NULL pointer dereference Attempt to fix NULL pointer dereference Jan 29, 2025
@alewycky-tenstorrent alewycky-tenstorrent merged commit b8c5e42 into main Jan 29, 2025
12 checks passed
@joelsmithTT joelsmithTT deleted the joelsmith/34 branch January 29, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Null pointer dereference on process exit after device removed
2 participants