-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cvector-generator
#8724
Comments
I've looked into the code and related material and it seems like this PCA idea is broken.
The hardest part is the last part, so in #6880 , @jukofyork suggested a simple algorithm that can calculate the largest eigenvector (which easy to do efficiently compared to trying to do all eigenvectors at once), which would give the 1st principle component. However, even if PCA was implemented correctly, it might still not work. This is because you're not supposed to take the PCA of the difference between the positive and negative vectors. If we look at the paper that #5970 referenced, we see that in "Constructing the PCA Model" they specify "we do not use any labels by default unless explicitly stated otherwise, making the procedure unsupervised" - this is an unsupervised method, meaning it takes random unlabeled data (not positive and negative examples) and finds features on its own. Another quote from the paper - "PCA - We take an unlabelled dataset D that primarily varies in the concept of interest. We take the top PCA direction that explains the maximum variance in the data." Some of the wording in the papers seems to imply that running PCA on the differences would make sense, but if you think about how PCA is supposed to work, you would expect that to fail, since the first step centers the data, which would subtract the average difference and obliterate the main direction of difference between the positive and negative examples, leaving PCA to find some other unrelated feature left behind after that. Random pairing would not have this issue since random differences would not average to anything meaningful and therefore keep the negative-positive feature intact. Randomizing the order/direction of the pairing (e.g. positive-negative vs negative-positive) would also alleviate this issue, which I expect might be the reason why the paper reports good performance of PCA on paired differences, though there might also be some other reason that I'm missing. But even in the paper, PCA is used mainly as an unsupervised method, while mean difference is used as a supervised method. The blog post referenced in #5970 seems to be confused by this and suggest taking PCA of differences in the blog post. If you look at github issues for the control vector library that is used in the blog, vgel/repeng, people report it not working for some models other than the one the author used (issue 33). But it does appear to work in the blog post and some other related discussion, so maybe there's something about PCA that counterintuitively works here despite what would be would expected. Your implementation also seems like it could accidentally work in some situations, since you are multiplying a matrix of differences by its transpose, which is somewhat vaguely similar to how correlation is calculated, though differences to the mean are used rather than paired differences. But for examples that vary significantly in some direction from the mean, properly paired differences would be roughly in the same direction as difference from the mean so your matrix would accidentally compute something PCA-like, which could work, sometimes, and only when the direction of the largest difference is aligned with the desired negative-positive difference. Indeed, your examples didn't work as expected for me when I tested them, the control vector produced sad/dramatic outputs for both positive and negative scale values. The model I was using was not llama however so maybe it really does work on llama for some reason. TL;DR PCA is not supposed to be used for paired negative-positive examples, and the implementation is broken anyway. |
Finding the eigenvectors of a data matrix multiplied by its transpose is a valid way of computing PCA so long as you mean center the data matrix first (if you use SVD on the data matrix and then take the right singular vectors then you don't need to mean centre though). I think the main problem with the current code is it's missing the final step where you project the data onto the vector to decide if the sign of the vector needs flipping - without this step the control vectors are very broken. @ngxson said in a different thread he was busy with his job at but going to look at this soon. If he doesn't then I will have a look at fixing it - I'm just not that familiar with the Computing the first PCA component only is actually really easy and doesn't really need to calculate the covariance matrix at all: use gradient descent to train an autoencoder with a single scaler using least squares loss (aka Oja's rule). You can extend this idea to get the first few PCA components too (by repeated subtraction of the component then reruning) but as the eigenvalues get smaller and smaller if starts to suffer from numerical problems. |
I've actually found what I think is a better way to do this though via the cross-covariance matrix: https://github.com/jukofyork/control-vectors The basic idea is to generate 3 samples for each prompt: baseline, negative and positive. Then subtract the positive and negative from the baseline (instead of mean centering). Then finally find the eigenvectors of the cross-covariance matrix of these two matrices (and use either the discriminant ratio or variance reduction to find the eigenvector you want and some other stuff - see the code here ). The reason this seems to work better is it specifically looks for directions that are "balanced" in terms of their vector norms from the baseline. The cross-covariance matrix can only have a large value if both are large (and is maximised when both are equidistant from the baseline) whereas the covariance matrix can have a large value if one is almost the same as the baseline. |
I mean, I did mention that in
difference to mean being the same as centered data. (instead of correlation I should have said covariance). |
sounds like a good idea |
I've tried many different methods over the last month or so and I think what you mean is: concatonate both datasets and find the principal eigenvector? If so, then this doesn't always work because often the principle (and even the 2nd) eigenvector will actually not be the direction that forms an axis between the two things you want, eg: in terms of creative writing trying to use "positive" and "dark" examples often ends up with "story writing" as the first principle component. You can do some post processing like in the code I linked above to find which direction is the correct one, but it is still inferior to the cross-covariance matrix method from my experiments. |
There is a lot of my thoughts in this thread: https://huggingface.co/jukofyork/creative-writing-control-vectors-BETA-v0.1/discussions/2 and the different iterations of things I've tried are here: The latest control vectors use soft-thresholding as a form of L1-regularisation (the scale factor in The biggest limitation of the method is only being able to effect a single "writing trait" at a time - vector addition somewhat works but if very finicky and blended training data ends up with much less discrimination between the classes... The transformers code won't actually let me get the inputs to the It may turn out you still need the control vectors though (to act as a bias / affine transformation) as from experiments with Householder transformations on |
@mhnghfv Hi and thanks for having a look into the code. To give you a bit of context, the PCA implementation is initially proposed by @christianazinn . I'm not particularly good at math (since my background is software engineer), so to be honest, I would have been impossible for me to do this without invaluable help from him and @jukofyork . However, the implementation is not guaranteed to be exactly correct, so we're currently depend on help from contributors. It would be very helpful if you can create a PR to showcase your approach. A quick update for the current issue: I started some initial work in this branch, but it turns out to be far more complicated than I thought. It's true that simply subtracting negative - positive pairs is quite naive, but doing other way will require a refactoring. As I'm currently busy with other part of the project, I can't be confident that it can be finished very soon. |
If you can get the code to the stage where the data matrices are just stored as plain C++ vectors or a straightforward chunk of C floats, then I can quite quickly get the rest of the code working just using plain C++ vectors (on the CPU) and then look at porting this to use the I just can't spend the time needed to learn all the ins and outs of the If I could just get a couple of |
Even just some very barebones code to set up the callbacks, pulling the floats out of the |
Yeah, I thought I might be able to fix the sign problem easily when ChuckMcSneed mentioned it in the other thread, but by having the matrices pre-differenced I couldn't see an easy way either :/ I think the pre-differenced matrix is the main blocker atm. |
IIRC this is actually what @christianazinn did in the first version of the implementation. It works well for prototyping because it doesn't require playing with ggml, so it's also a good practice. Once everything works and you want to have proper GPU support, I can help to convert the implementation to ggml. |
Oh, I'll have a look tomorrow at the commit history then! |
Just posting these here so I don't lose them: Simple Algorithms for the Partial Singular Value Decomposition The first "Power Method" algorithm looks fairly easy to adapt pca.hpp to be able to calculate truncated SVD. It should be a lot more efficient for rectangular matrices, eg: Sadly it won't allow for calculation of the Moore–Penrose pseudo-inverse via SVD as the error is proportional to the reciprocal of the smallest singular values: But, this paper outlines an algorithm that looks to be reasonably easy to implement: An iterative method to compute Moore-Penrose inverse based on gradient maximal convergence rate |
I want to make the control vectors allow for low-rank transformations as well as just offsets/directions first, but after that I am going to see if I can fix up the current cvector-generator code (and possibly implement the algorithms above which might also be useful for #8831), but I have a couple of questions about the current pca.hpp implementation: What I don't understand is:
Is there a reason for calling
for (int i = 0; i < params.n_batch; ++i) {
// b_tensor = square * eigenvector^T
b_tensor = ggml_mul_mat(ctx0, input_square, old_eigen);
ggml_set_name(b_tensor, "b_tensor");
// normalize
b_tensor = ggml_div_inplace(ctx0,
b_tensor,
ggml_sqrt_inplace(ctx0, ggml_sum_rows(ctx0, ggml_sqr(ctx0, b_tensor)))
);
ggml_format_name(b_tensor, "b_tensor_norm_%d", i);
// calculate distance(new eigenvector - old eigenvector)
// we don't use ggml_sub because it may not be implemented on GPU backend
struct ggml_tensor * new_sub_old = ggml_add(ctx0, old_eigen, ggml_scale(ctx0, b_tensor, -1));
distance = ggml_sqrt_inplace(ctx0,
ggml_sum_rows(ctx0, ggml_sqr_inplace(ctx0, new_sub_old)));
ggml_format_name(distance, "distance_%d", i);
old_eigen = b_tensor;
// build operations nodes
ggml_build_forward_expand(gf, distance);
} This looks more like an unrolling of the Is this related to the above and some way of lessening the extra overhead of the repeated creation of the compute graph? |
Also related to the "make the control vectors allow for low-rank transformations as well as just offsets/directions" changes: What's the standard way of passing complex objects into the Passing as a My GGUF saving code (that is based off the repeng/extract.py source) also adds these two parameters to the control vector files it creates: writer.add_string(f"{arch}.model_hint", self.model_type)
writer.add_uint32(f"{arch}.layer_count", len(self.directions)) where the Adding this sanity check in would likely make the C++ --> C conversion of the data structures much less ugly than the current AFAIK, nobody else is making control vectors, and I think I can add in the low-rank transformations I plan to add without breaking any backwards compatibility, but adding this sanity check may break control vectors what were created without it? Is this an OK breaking change to make? |
In addition to that, we also actually build cgraph to calculate in batch. This is controlled by The advantage is that we can make GPU to do more at one go, but disadvantage is higher memory usage. |
Yes you can make a PR for that. I agree that not many users are actively using control vector atm, so breaking changes is ok.
Header file is plain C because many bindings (like It's better not to have a too complicated struct. Instead, the "C way" is to break into smaller structs and use multiple function calls to retrieve the data. I don't know how to better explain it, but that's the way to do when working with low-level programming. |
Thanks - the graphs are really helpful! :) Won't this just block on the matrix-vector product that creates the |
The speed gain is not because of computational complexity, but thanks to the time it saves to transfer data between CPU <==> GPU. If the only one iteration is executed by GPU at a time, then the data will need to be copied CPU <==> GPU very frequently. That's why when working with GPU, it's always better to tell the GPU to process in batch (it's both true when using GPU for ML and for 3D graphics) |
Yeah, it's this function that is going to get incredibly ugly: // Apply a loaded control vector to a llama_context, or if data is NULL, clear
// the currently loaded vector.
// n_embd should be the size of a single layer's control, and data should point
// to an n_embd x n_layers buffer starting from layer 1.
// il_start and il_end are the layer range the vector should apply to (both inclusive)
// See llama_control_vector_load in common to load a control vector.
LLAMA_API int32_t llama_control_vector_apply(
struct llama_context * lctx,
const float * data,
size_t len,
int32_t n_embd,
int32_t il_start,
int32_t il_end); It's actually already padding In C++ I would just pass a |
Yeah, it does seem amazingly complex way of doing something so simple though! :) Using Python or a C++ linear algebra library I could implement all that in 5-10 lines, and even using GSL or MKL it would only be 30-40 lines of C! I think I may just go back to my idea of sampling the hidden states in |
IMO you can use an "internal" struct for control vector and expose functions to create/manipulate it. For example: // internal cpp struct, don't need to define it in header file
struct llama_control_vector;
struct llama_control_vector * llama_control_vector_init(const char * file_path);
int32_t llama_control_vector_apply(struct llama_context ctx, struct llama_control_vector * cvec, int il, float scale);
// example usage
struct llama_control_vector * happy = llama_control_vector_init("...");
for (int il = 0; il < llama_n_layer(model); i++) {
// only apply if il > 5 and < 10
llama_control_vector_apply(ctx, happy, il, (il > 5 && il < 10) ? 1.0 : 0.0);
} (Have a look on lora functions, it has the same idea) |
If it is OK to make the breaking change of reading the I'll try and get a PR working later today to show what I mean. |
Yeah in fact it's because support for GPU in ggml is added at later stage, so the new GPU-supported API is built with "backward compatibility" in mind. Btw if you want to discover ggml, I recommend to have a look at this blog post: https://huggingface.co/blog/introduction-to-ggml |
Thanks - that looks a lot more helpful than what I managed to find in the My main goal now is just to get the low-rank transformations working though as by being able to do "abliteration" via control vectors will likely make many more people become more aware of the other purposes, this post outlines the link: I'm not actually that interested in the "abliteration" stuff, but hopefully more people might become interested in it (currently I've no idea who downloads my control vectors but there seems to be about 3 people max who have given any feedback.... lol). |
This post also explains why I would need to implement the Moore–Penrose Inverse (or a least squares solver) for this: and why it might be a good idea to completely lock down the algorithm before trying to implement it in ggml code... |
Yeah, I just looked at this: the struct llama_lora_adapter_container : llama_lora_adapter_info {
struct llama_lora_adapter * adapter;
}; It looks a bit convoluted but will have a go at duplicating this setup for the control vectors. |
Just to clear this up, I am not sure what I said, but it is absolutely possible to evaluate the same graph any number of times. In transformers the graph changes with each evaluation due to the positions and size of the context changing after every token, so we rebuild them every time, but if the graph does not change it is not necessary to build it again. @jukofyork prototyping with another library is probably a good idea. It shouldn't be too bad once you get used to it, but ggml certainly requires more boilerplate and may not be as flexible as other libraries. |
Yeah, I think in this case the matrix-vector product dominates all the other computations as it's O(n^3) where n=12k for the largest models. I think if we could just wrap the matrix-vector product then everything else could be done cheaply on the CPU. Also since we have one of these to do for each layer, we could concatenate the columns of some or all of the (n, n) To compute truncated SVD using the algorithm from the 1987 paper I linked above we would need to also compute This would have the disadvantage of not being able to stop early when a numerical tolerance is reached though... |
Hi, willing to give a try and work on this issue. I'm just a little lost where is required attention, the pca.hpp implementation is already complete ? How is going the SVD and Umap implementations ? |
As suggested by @jukofyork (comment above), the PCA implementation is not completed. I'm currently busy on other parts of llama.cpp so I don't yet have any plan to fix this. Please have a look on this comment |
Thank you for the response, I'll investigate on the matter and try to finish the implementation then, I've been very interested on the linear algebra side of these algorithms :) |
llama_decode
)Ref: #7514
Note: I'm writing this for tracking, but people are free to do PRs if you have any ideas to share. I'm planning to do some of the tasks in August 2024.
The text was updated successfully, but these errors were encountered: