-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to accelerate training with large dataset #395
Comments
Hi @hhlim12
Yes we have a very experimental version, and it's better to be used if you already have a big dataset to be directly fed in, instead of running active learning.
The OpenMP parallelization is automatically used within flare++, so the speed has been optimized in that sense. I don't think you can get faster.
Yes, usually what we do is to set up multiple trainings and collect data of different configurations. And then we combine those data to train one final model. This is achieved by the offline training using |
Hi @YuuuXie , Thank you for the kind reply. I understand your answer. We have the dataset already and want to go for the final model. Is the MPI version can be accessed through this repository? (there are About the offline training using |
This is what I used to compile it in my cluster, and you might need to install the corresponding modules to enable the compilation module load cmake/3.17.3-fasrc01 python/3.6.3-fasrc01 gcc/9.3.0-fasrc01
module load intel-mkl/2017.2.174-fasrc01 openmpi/4.0.5-fasrc01
git clone https://github.com/mir-group/flare.git
cd flare
git checkout feature/yu/mpi
mkdir build
cd build
CC=mpicc CXX=mpic++ FC=mpif90 cmake .. -DSCALAPACK_LIB=NOTFOUND
make -j Below attached a python script as an example usage. |
@YuuuXie , Thank you very much for kind response! I finally manage to install it in my cluster, though I need to troubleshoot some MKL library problems.
Edit: I append the following in the
I also found that training with two elements works fine, but increasing the number element > 2 gives me this error:
|
Hi,
Sorry for the late reply. There is supposed to be GP model checkpoint dumped as json automatically during training. If you want to increase dumping frequency, you can set up the “write_model” parameter (eg. to 4).
Best,
Yu
On Jul 18, 2024, at 16:46, hhlim12 ***@***.***> wrote:
@YuuuXie<https://github.com/YuuuXie> , Thank you very much for kind response! I finally manage to install it in my cluster, though I need to troubleshoot some MKL library problems.
Even though I can run the fpp_train.txt without problem, there is no sgp model dumped in the end. Do you know additional script to output the model (e.g., lmp.flare or the json file)? Do you have reference for that? Thank you in advance!
—
Reply to this email directly, view it on GitHub<#395 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFYVQC7WJMUNWS6PIKYTWRTZM7ILHAVCNFSM6AAAAABEODCVAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZWG43DKOJUHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi, Thank you for the kind response. I'm not sure where I should specify the "write_model" parameter in the sgp_calc = SGP_Calculator(sparse_gp)
sgp_calc.build_map("lmp.flare", "my_name") However I still could not solve the training problem with more than two elements..(I put the error message in the previous message). Thank you for your kind attention, Harry |
Hi Harry, you could probably try increasing the number of initial data such as the number of frames and number of sparse points. It could be that the model was initialized with insufficient number of datapoints so it does not distribute correctly, i.e. some processes get empty set.
…________________________________
From: hhlim12 ***@***.***>
Sent: Thursday, August 8, 2024 9:30 AM
To: mir-group/flare ***@***.***>
Cc: Yu Xie ***@***.***>; Mention ***@***.***>
Subject: Re: [mir-group/flare] How to accelerate training with large dataset (Issue #395)
Hi, Thank you for the kind response.
I'm not sure where I should specify the "write_model" parameter in the fpp_train.txt, but I have appended the following in the code and it does dump the model correctly:
sgp_calc = SGP_Calculator(sparse_gp)
sgp_calc.build_map("lmp.flare", "my_name")
However I still could not solve the training problem with more than two elements..(I put the error message in the previous message).
Thank you for your kind attention,
Harry
—
Reply to this email directly, view it on GitHub<#395 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFYVQCYOIOIN2LPJUFCRQ4DZQMNALAVCNFSM6AAAAABEODCVAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZVGEZTQMRVGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks to the developer for making such an excellent code. Could it be possible to consider providing an additional interface like GAP that uses CUR matrix decomposition for sparsity and then fits the model in one go, rather than adding it frame by frame (I found that both OTF without calculating uncertainty and offline methods in the tutorial were slow in fitting more than 4000+ crystal structures.). Or It is highly hoped that this offline training mpi version could be improved and verified. |
Hi @YuuuXie, Thanks for the kind reply. I just wanted to confirm that the error was solved when I increased the initial data, as you suggested. Thanks again for kind response. Best, Harry |
Hi @rbjiawen, We're not actively working on new features, but note that you don't need to refit the model after each structure is added. Slightly modifying the static training example given in this tutorial, you can pull the call to
You're also free to choose how many sparse points you want to include. The fewer you choose, the faster the fitting step will be, at the cost of model accuracy. Hope that helps, |
Thanks for your reply, yes I did it this way, after adding some atomic cluster environments, I updated the covariance matrix and am randomly selecting sparse points to see the results. |
Hi,@jonpvandermause
Every time when the loop reaches more than one hundred structures, sparse_gp.sparse_gp.update_matrices_QR() to update the covariance matrix or sparse_gp.train() to optimize hyperparameters Triggered the c++ code Segmentation fault error!
I tried the different interval to optimize hyperparameters or update the covariance matrix , but this error still exists, and my train_structures should have no error and more than 4,000 structures are included. I can get well results using other code ,e.g. GAP, NNP. |
Can you send me a complete script so I can try to reproduce? For example I'm not sure how Two things you might try:
|
Hi, I am getting "Segmentation fault" error using this test file and extxyz
file. Extremely grateful.
Jonathan Vandermause ***@***.***> 于2024年10月7日周一 00:41写道:
… Can you send me a complete script so I can try to reproduce? For example
I'm not sure how structure_with_B1_B2_descriptors is defined.
Two things you might try:
1. Bump sparse_gp.Kuu_jitter to see if that stabilizes the likelihood.
2. Monitor your RAM usage while the program is running. The sparse GP
object consumes a lot of memory and it's possible you're maxing out your
machine's resources.
—
Reply to this email directly, view it on GitHub
<#395 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVNSGTLECTNZEGT4LP5P2TLZ2FR43AVCNFSM6AAAAABEODCVAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGUYDENZSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't see the files. You can paste the entire test script into the comment box, I'll try to reproduce with my own structures. |
|
The xyz file and py.
Jonathan Vandermause ***@***.***> 于2024年10月7日周一 00:41写道:
… Can you send me a complete script so I can try to reproduce? For example
I'm not sure how structure_with_B1_B2_descriptors is defined.
Two things you might try:
1. Bump sparse_gp.Kuu_jitter to see if that stabilizes the likelihood.
2. Monitor your RAM usage while the program is running. The sparse GP
object consumes a lot of memory and it's possible you're maxing out your
machine's resources.
—
Reply to this email directly, view it on GitHub
<#395 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVNSGTLECTNZEGT4LP5P2TLZ2FR43AVCNFSM6AAAAABEODCVAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJVGUYDENZSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The files are attached to google e-mail, thanks. |
Hi,@jonpvandermause |
No, On the seg fault issue: can you give me a bit more information? Do you reliably hit the seg fault when a certain number of structures get added, or is it random? Do you also hit it when only one descriptor is being used? How much RAM does your machine have? Thanks, |
Thanks! |
When I added all the environments and updated the covariance matrix every 100 steps without optimizing the hyperparameters, 2100 structures were added without any problems and the output error was as follows:
|
Only when all environments are added, the error does not occur, but it seems to occur when adding structures by uncertainty, even without optimizing the hyperparameters. |
Interesting, thanks. Just to confirm - you're running with the current master branch? If not, can you confirm that you still hit this on master? |
I need to confirm this, thanks for the heads up! |
@rbjiawen, when you get a chance, can you please email me your .extxyz file? It didn't attach above, probably because it's not one of github's allowed file formats. It will be easier to reproduce the segfault with your exact system. My email address is at the top of the sparse GP tutorial. |
Thanks again, I have sent it to jonpvandermause@gmail.com. |
Thanks @rbjiawen. I've reproduced the segfault on my machine, and I believe I've root caused the issue. To safely make calls to This is the exact line from sparse_gp.cpp that triggers the segfault:
When If you call I will update the code to require Nice find! |
Thanks very much!In other words, every time I add the most uncertain atomic environments, I must update the covariance matrix, instead of just updating the covariance matrix before hyperparameter optimization?
---- Replied Message ----
FromJonathan ***@***.***>Date10/12/2024 08:11 ***@***.***> ***@***.***>,
***@***.***>SubjectRe: [mir-group/flare] How to accelerate training with large dataset (Issue #395)
Thanks @rbjiawen. I've reproduced the segfault on my machine, and I believe I've root caused the issue.
To safely make calls to add_uncertain_environments, you need to make sure that the internal matrices of the sparse GP are up to date, since these internal matrices (specifically the L_inv matrix) are needed to evaluate uncertainties.
This is the exact line from sparse_gp.cpp that triggers the segfault:
Eigen::MatrixXd L_inverse_block = L_inv.block(sparse_count, sparse_count, n_clusters, n_clusters);
When L_inv isn't up to date, you are attempting to access a part of the matrix that doesn't exist yet, which gives undefined behavior and occasionally results in a segfault.
If you call update_matrices_QR() after every call to add_uncertain_environments, you should avoid this.
I will update the code to require L_inv to be up to date when add_uncertain_environments is called.
Nice find!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes, that's right. And more generally, before making predictions, the covariance matrices must be current. I've placed assertions in the code in #419 that throw runtime errors if this isn't the case, which should make this easier to catch moving forward. I'm going to mark this as resolved. Feel free to open another ticket if you encounter other issues. |
Dear FLARE developers,
Thank you so much for the great development (and the user-friendliness) of the package.
But I have problem when the training set becomes larger, the update of SGP becomes slower as expected from larger covariance matrix. So I'd appreciate if you have comment and answer for the following questions:
Thanks in advance !
Harry
The text was updated successfully, but these errors were encountered: