-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
topic extraction from 'Quick Start' taking forever #510
Comments
Typically, when running takes longer than you would have expected it is due to the absence of a GPU or perhaps some dependency issue. Having said that, which GPU are you currently using? Also, could you share the versions of all dependencies in your environment? |
Just realized this but it might be worthwhile to set |
Hey @MaartenGr , sorry that I couldn't get back to you until now as I was dealing with my finals. So, I've attached the screenshot above. Basically, a brief recap: after updating my tokenizer version from 0.10.3 to 0.11.0, I no longer get the "Ignored unknown kwarg option direction" message but it was taking forever to run. So, I tried again with verbose=True, like you suggested, and I could see that it was running but was just taking too long. It took a little over 54 minutes to run your Quick Start tutorial, which I'm pretty sure is a lot longer than it should be. I have attached the screenshot above. I'm running it on M1 13-inch Macbook Pro 2020, which I'm pretty sure has the integrated GPU. Regarding the dependencies, I have the following packages: abseil-cpp 20210324.2 hbdafb3b_0 conda-forge Thank you and I look forward to hearing from you. |
There are several things happening here that I believe might explain the issue you are experiencing.
Starting with the screenshot that you posted. It seems that you checked the GPU instance with tensorflow whilst sentence-transformers rely primarily on pytorch. I should have been more clear about that, my apologies! Could you instead follow along with the code here and share the results?
Based on the output, it seems that both UMAP and HDBSCAN are relatively fast and only take one or two minutes to compute. Moreover, from the
It could be that although tensorflow recognizes the GPU, the same might not be said for pytorch. Another reason for the long computation time is that although your laptop has an integrated GPU it might not be a fast one. Although the existence of a GPU definitely helps it might also depend on the GPU itself. Although I am not entirely certain yet, my guess would be either pytorch nog recognizing the GPU on a M1 chip or that the GPU is quite slow. However, I have little familiarity with Macbooks... |
Hi @MaartenGr, I followed along the code for the gpu test you suggested. Above are the results. It looks like indeed Pytorch isn't detecting my GPU. I also tried the gpu test with tensorflow gpu test using 'tf.test.gpu_device_name()' and I get the error that says 'Could not identify NUMA node...' which is shown at the top of the screenshot. I'm not exactly sure how to interpret it correctly (I tried stackoverflow but couldn't really get a solid answer). So yeah basically looks like when I ran your 'Quick Start' tutorial it ran on CPU. Any thoughts on I can tell jupyter to run on GPU and not CPU? I bought this macbook thinking that M1 chip would help run ML experiments a lot faster but it has been nothing but a headache. |
I think this is related to the M1 chip that you are using that doesn't natively supports cuda. From what I could gather here, it seems that they are currently working hard on GPU acceleration for the M1 chips but that there isn't a concrete time yet when they think it can be published. Unfortunately, that seems to mean that it is not possible as of this moment and you will have to wait until they added support. Instead, it might be interesting to use USE instead of sentence-transformers as it uses tensorflow, and, from what I remember, is quite fast. |
Hi @MaartenGr , I tried USE like you suggested. Please take a look at the screenshot above. As you can see this one only took approximately 11.5 minutes which is five times faster than running BERTopic using Pytorch. However, I'm not quite sure if my Tensorflow is running on M1 GPU as well since I feel it shouldn't take 11 minutes. In particular, I want to refer to the Warning in the screenshot above: """ Have you seen this warning before? I'm not quite sure how to interpret this to be honest but it looks to me like Tensorflow is attempting to run/running on CPU. Would you be able to share how long it took you to run your 'Quick Start' code above? Thank you and I look forward to hearing from you. |
Also another quick follow-up, I did the tensorflow GPU access test and as you can see in the screenshot, it suggests that Tensorflow is able to access the GPU. However, when I tried
I got the following error (which is shown in the at the bottom of the screenshot above): """ So, I'm not sure if it means my Tensorflow can access the GPU but couldn't run on it (?) |
Unfortunately, I do not have experience with Tensorflow on M1 chips, so I am not sure how much I can be when debugging this. It does seem that you are not the only one experiencing this issue although no clear solutions seem to be given.
I just ran your code in a Kaggle notebook and for me, it can create the embeddings in only one minute, which is quite a bit faster than your case. I should note that because it has different hardware, differences are to be expected. It might be helpful to disable the GPU in order to see if the bottleneck can somehow be found there. |
I'm having the same performance problem on a MacBook Pro (16-inch, 2021) Apple M1 Pro with macOS Monterey 12.3.1. I'll probably spin up a virtual machine in the cloud until the M1 chip supports GPU acceleration. |
It seems that quite recently PyTorch started supporting M1 chips and now allows for GPU acceleration. You can find a bit more about that here. Do note that I have not tested this newest version out with BERTopic yet, so I cannot be sure it is working with the current set of dependencies. Having said that, with a bit of luck, this might solve the issues M1 users are having. |
Thanks for the tip, @MaartenGr!
Can I somehow "force" BERTopic to use the latest preview/nightly build of PyTorch to try it out? Edit: I suppose I would have to clone the BERTopic repository, chance the PyTorch dependency to the latest preview (nightly) build, then install my own fork on BERTopic in my Python project? That's what I'm trying to do now. I found the dependency to PyTorch in But I'm not sure how to change that to include the preview (nightly) build of PyTorch. I've installed the preview (nightly) build locally via
But since I'm using virtual environments and Poetry to manage my Python project, the version of PyTorch which comes bundled with BERTopic seems to take precedence over the system-wide installation (as expected). |
@leifericf Since you are using Poetry, I believe you will have to do some manual installation as I am not sure whether supplying a link during the pip install is possible within Poetry. If you are okay with a manual pip install, then I would advise installing BERTopic first and after that install the nightly pytorch as follows:
That will override the pytorch install that you made before with BERTopic and install the nightly pytorch. |
Thank you for the advice, @MaartenGr! I think you're right, and I got stuck trying to install the pre-release version of PyTorch via Poetry. I opened this issue on Stack Overflow to get help with that. I think it's possible somehow, but I'm not skilled enough with the more advanced options of Poetry. I'll update this issue if I figure out how to do it. |
Update: I managed to successfully install the latest preview (nightly) build of PyTorch and the newest version of BERTopic after ditching Poetry in favor of using Here is how I installed PyTorch:
Then, I installed BERTopic via
When installing BERTopic, I encountered two other issues. The first issue was that the
See this issue for more detailed information. The second issue was a bit more tricky. The
Then, I had to download the newest source code (from here), and use
See this issue for more detailed information. Here is gist containing an export of my conda environment after all of that was fixed. But now I'm getting this somewhat cryptic error when I try to use BERTopic:
That error is unrelated to "quick start taking forever," so I will create a separate bug report for it. I hope this information can help fellow Apple M1 Pro users get started with GPU-accelerated BERTopic. |
I have opened the issue mentioned above here. |
Scratch my last comment. I've been struggling all day, and there is some flavor of circular dependency hell between pytorch, hdbscan, and blas because of other packages I'm using. At this point, I've tried to recreate my conda environment probably 30+ times, using conda and pip in different orders, manually editing my I have been able to fix each of the different issues independently, but every time I fix one problem, it causes another. I'm not smart enough to figure this out, so I'm just going to give up. I might try again after the most recent version of BERTopic is available via conda-forge (instead of pip), and the version of PyTorch with support for Apple M1 Pro is stable. I will try to use Gensim for topic modeling instead for the time being. |
That is a shame to hear! Hopefully, other users will figure out a way to approach it as I unfortunately do not have a MacBook to test on. At the very least, thank you for going in that much depth trying to figure out a solution. |
No problem, and thank you for sharing this project and taking the time to help. I should emphasize that the issues I'm experiencing are not the fault of BERTopic. My particular project seems to have irreconcilable dependencies when using conda and pip on an M1 Mac. And certain low-level dependencies are missing wheels, won't compile, etc. I still want to use BERTopic, and I’m happy to help with testing and debugging, but I might need some guidance to be helpful. Please let me know if and how I can help. |
Good news! I got BERTopic working now after the latest version was added to conda-forge. To avoid dependency conflicts, I had to create a new (empty) Conda environment, install BERTopic, and install the PyTorch preview (nightly) build in that specific order. This is exactly what I did:
And here is a gist showing the resulting conda environment. Attempting to install BERTopic and PyTorch in my existing Conda environment (which includes a lot of other packages for my project) resulted in an insane amount of dependency conflicts. I had to start with a clean environment. That said, I'm unsure whether I'm utilizing Apple's GPU. I don't know how to tell, except by looking at the verbose output of BERTopic to see if it's faster than before. Batching And I can see that the CPU/GPU usage is quite high and constant: |
Glad to hear that the installation has gone well! The difference in speed when encoding the documents to embeddings is where you will see the most difference between using a CPU and GPU. If you find that there is a big speedup, you will indeed know that it has worked. Thanks for sharing the steps, that will definitely help out others experiencing this issue! |
@leifericf this was super helpful. |
Hi Maarten,
I've been following your Github. I installed Bertopic using conda. Then I tried to replicate your Quick Start to see if it's working as expected:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
Then, at first I was getting the following (which goes on forever):
runfile('/Users/nayza/Desktop/YTproject/AAHSA/addictionStudy_2.py', wdir='/Users/nayza/Desktop/YTproject/AAHSA')
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Traceback (most recent call last):
But then I suspected that I needed to update my tokenizer so I updated it from version 0.10.3 to 0.11.0. Then, I see that it doesn't show the 'Ignored unknown...' output anymore but it's taking forever to run. Plus, my Mac started to get really loud as well.
Do you an idea what might be an issue here?
The text was updated successfully, but these errors were encountered: