Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : update mul_mat_id to use the same tensor for all the experts #6387

Merged
merged 36 commits into from
Apr 3, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Mar 29, 2024

Changes the storage of experts in memory from a tensor per expert, to a single 3D tensor with all the experts. This will allow us support models with a large number of experts such as qwen2moe.

Existing MoE model files (ie. mixtral and grok) with split experts are still usable, however, for the CPU and Metal backends, it will cause the data will be copied to a buffer without mmap, which may increase load times slightly. Additionally, imatrices created after this change is merged cannot be used to quantize old models, and imatrices created with previous versions of llama.cpp cannot be used to quantize new models.

Fixes #6082

  • CPU backend
  • CUDA backend
  • Metal backend
  • Merge experts into a single tensor in convert scripts
    • convert.py
    • convert-hf-to-gguf.py mixtral
    • convert-hf-to-gguf.py grok
  • Clean up loading code
  • imatrix
  • quantize : handle imatrix correctly with 3D tensors

This comment was marked as off-topic.

ggml-metal.m Outdated Show resolved Hide resolved
slaren and others added 3 commits March 30, 2024 11:42
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@slaren
Copy link
Collaborator Author

slaren commented Mar 31, 2024

I tried to update convert-hf-to-gguf.py, but eventually it crashes with SIGKILL. I think it is running out of memory because the merged experts tensors stay in memory. Somebody with more experience with python will need to figure how to fix that.

Also for grok, I don't even have enough free disk space to try that one.

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

convert-hf-to-gguf.py finishes successfully and the converted model runs correctly, but it does take a lot of memory to convert. I will take a look as well, but my Python is also not great. Pinging @cebtenzzre

@slaren
Copy link
Collaborator Author

slaren commented Mar 31, 2024

Ok, that's good to know. I guess the same code should be usable for grok, but the memory usage may be a problem. For mixtral, convert.py also works with HF models anyway, which doesn't have the memory usage problem.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@foldl
Copy link
Contributor

foldl commented Apr 1, 2024

How about make a new op, and keep ggml_mul_mat_id for backward compatibility?

@slaren
Copy link
Collaborator Author

slaren commented Apr 1, 2024

I think it is better to convert the weights to the new format at load time than to maintain multiple versions of the op. I also plan to make further changes to ggml_mul_mat_id in a followup PR to allow all the experts to be computed in a single call, so that would be yet another version of the op to maintain.

@foldl
Copy link
Contributor

foldl commented Apr 1, 2024

IMHO, developers who use ggml would prefer backward compatibility.

@ggerganov
Copy link
Owner

We generally try to maintain backward compatibility to a good extend, but sometimes it can be too difficult. MoE functionality is on a trend to become the standard so we need good support. Maintaining duplicate kernels would increase significantly the required effort

@foldl
Copy link
Contributor

foldl commented Apr 1, 2024

In this case, if the original ggml_mul_mat_id is kept, it can be marked as "deprecated" and is not maintained any more.

Pros:

  • Perfect backward compatibility;
  • No need to merge experts into a single tensor at runtime, mmap works as before.

Cons:

  • The name ggml_mul_mat_id is wasted.

Anyway, if a breaking change is unavoidable, then just do it as soon as possible.

@ggerganov ggerganov merged commit 08a0c02 into master Apr 3, 2024
60 of 61 checks passed
@sorasoras
Copy link

Good to know, maybe it deserves a breaking change label ?

Yes. It is actually a bit worse than that. After this change, an imatrix generated with an old model, cannot be used to quantize that same model. The imatrix will only work with newer models regardless of what model was used to create it.

If my understanding is correctly,Only affect MoE,right?

@ggerganov
Copy link
Owner

Yes

@JMPSequeira
Copy link

I'm sorry but as an heavy user of Mixtral 8x7b Instruct can some tell me if:
This requires a reconversion from the hf model to gguf or can I use the the same fp_16 gguf I already had?
Are there performance benefits of this PR?
Are there any special flags need using during the convert to quantize flow?

@slaren slaren deleted the sl/moe-rework-1 branch April 4, 2024 12:07
@slaren
Copy link
Collaborator Author

slaren commented Apr 4, 2024

If you want to create new quants with an imatrix, you need to convert the model again from an hf or pth model, you cannot use the same fp16 gguf. You don't need to use any special flags.

@maddes8cht
Copy link
Contributor

Thanks for this work.
Again, also being an heavy user of mixtral 8x7b (instruct):
After doing the reconvesion, can i expect some kind of performance benefits from this PR?

@slaren
Copy link
Collaborator Author

slaren commented Apr 5, 2024

If you are fully offloading the model there is very little advantage to converting the models again. For CPU and Metal, it allows using mmap which can improve load times (especially if you are restarting llama.cpp repeatedly), but in most cases it won't really matter. For performance benefits, see #6505 (but that also works with older models).

@he29-net
Copy link

will cause the data will be copied to a buffer without mmap, which may increase load times slightly

I would say that is a slight understatement. In my case, for Mixtral 8x7B, the load time increased from a few seconds to 3 minutes or so.

At first I thought my HW is dying, because the loading process was accompanied by a flood of kernel messages (related to Intel IOMMU*) so severe, it made the machine unresponsive and consumed all available space in /var/log within a few minutes.

That turned out to be likely an unrelated issue, perhaps just randomly triggered by the way llama.cpp uses ROCm while loading the model. Ruling out a HW problem, it occurred to me to try an older build. And sure enough, the slow loading went away and git bisect eventually led me to this commit.

It may have been a good idea to print a warning to the log when a model starts being loaded in a "compatibility mode". Would have saved me a lot of frustration and time. :) (But I have only myself to be angry at, because I even remember seeing this PR and the notes about breaking compatibility..)


*) In case anyone else on ROCm and Intel CPU also experiences a flood of messages like:

DMAR: ERROR: DMA PTE for vPFN 0x7bf32 already set (to 7bf32003 not 24c563801)

followed by a stack trace and other related info for a bug in drivers/iommu/intel/iommu.c, try adding iommu.passthrough=1 to your kernel boot parameters. This should bypass some DMA mapping stuff done by the OS, avoiding the issue. I don't use virtualization so I can't say if it breaks anything else, but the flood went away and so far I don't see any other side effects.

@slaren
Copy link
Collaborator Author

slaren commented Apr 12, 2024

It is impossible to account for AMD drivers unpredictability, this change doesn't cause any meaningful overhead when offloading a old model. Without offloading and with Metal, it may cause the model to be evicted from the system cache if you don't have enough memory for two copies. This will increase the load time of a second usage, but the time of the first load should be essentially the same.

@he29-net
Copy link

evicted from the system cache if you don't have enough memory for two copies

Ah, I suppose that would be the main slowdown in my case. I did not realize the conversion is not happening in place. Thanks for the background.

Agreed on the AMD drivers though, they still have plenty of work to do to make the whole stack rock solid and reliable (e.g., the GPU in my new AMD laptop crashed in three different ways in the past two months..)

@slaren
Copy link
Collaborator Author

slaren commented Apr 12, 2024

If you are not offloading a large portion of the model while using a CUDA or HIP build, a difference is that it will try to allocate a pinned buffer for the CPU portion of the model (this is the same that happens when disabling mmap). In some systems this can cause instability if there is too much pinned memory, since this memory is not available to other processes. You can disable this behavior by defining the environment variable GGML_CUDA_NO_PINNED, at the expense of slightly slower prompt processing performance. When using mmap (either with an older version, or with a new model), the CPU buffer is not pinned.

@he29-net
Copy link

he29-net commented Apr 13, 2024

Now I realize I first misunderstood your comment about model being evicted from the cache; I'm loading from a fast SSD, so cache alone would not explain such a big slowdown. But what you say about pinned memory probably explains it: I'm offloading about 10 of 33 to the GPU, and the portion left to the CPU takes up almost all the RAM. So if the CPU portion is pinned, there is barely anything left to work with, and the slowdown is probably caused by swapping. Indeed, after setting GGML_CUDA_NO_PINNED, the loading time becomes reasonable even with old model + new commit.

EDIT: Now reading about memory pinning, and the IOMMU errors may be related after all, since another, less common message present in the flood was "amdgpu: init_user_pages: Failed to get user pages: -1". And get_user_pages() is mentioned right there, in the article about pinning... You learn something every day. :)

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
…6387)

* ggml : update mul_mat_id to use the same tensor for all the experts

* update cuda

* minor

* update metal

* update test-backend-ops

* fix cuda

* Update ggml-metal.m

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update convert.py

* update convert-hf-to-gguf.py

* update convert.py for mixtral hf models

* Update convert-hf-to-gguf.py

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* cuda : support non-pow-2 number of experts

* allow quantize to work for split and merged experts models in the same way

* cleanup + disable mmap automatically with split tensors models

* update imatrix

* test-backend-ops : test qwen argsort

* update grok model loading

* llama : add merged experts tensors to the grok tensor map

* minor

* gguf : bump version

* fix quantizing of merged experts

* convert-hf-to-gguf.py : update grok (untested)

* make linter happy

* cuda/argsort : use shared memory instead of pool memory

* convert : fix grok tensor names

* metal : add support for non-pow-2 argsort

* llama : more loader cleanup, better error checking

* cuda : fix warning

* llama : still use mmap for loading old models, but copy the data to a host buffer

* add review note

* llama : remove ffn tensor counting + add sanity check

ggml-ci

* convert : fix handling of n_experts == None

ggml-ci

* imatrix : fix ncall counters

* llama : produce error if imatrix size does not match

* quantize : terminate on errors + trace logs

ggml-ci

* metal : pad shared memory to 16 bytes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@yuhai-china
Copy link

this is a very bad change since lots of mixtral models need be created again, please consider backward compatibility as llama.cpp is so popular.

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

It's possible I have found a problem with the MOE imatrix calculations as a result of this PR's changes I guess, but posted the info in #6515 as DBRX was causing me the problems.

I don't really know what the intentions are with the weighting (should less selected experts have lower importance now, etc), and the actual quant code is very obtuse but it looks a lot like the experts' MLP weights are all getting downscaled by a factor of n_experts too much, and it's possibly not just a benign change of scale for all weights in the same tensor. From ggml-quants.c:

        const float * xbl = x + QK_K*ibl;
        float sumx2 = 0;
        for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
        float sigma2 = 2*sumx2/QK_K;

        for (int ib = 0; ib < QK_K/32; ++ib) {
            const float * xb = xbl + 32*ib;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

@slaren
Copy link
Collaborator Author

slaren commented May 5, 2024

Hi @jukofyork, thanks for looking into this. It's very possible that this PR introduced some issue in the imatrix generation of MoE models, I tried to maintain the previous behavior, but I don't know how it works and the code is entirely uncommented, so I depend on code review.

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

Hi @jukofyork, thanks for looking into this. It's very possible that this PR introduced some issue in the imatrix generation of MoE models, I tried to maintain the previous behavior, but I don't know how it works and the code is entirely uncommented, so I depend on code review.

Hi, I just tested this change to quantize and it does look like this helps:

                for (int j = 0; j < (int)src1->ne[0]; ++j) {
                    e.values[e_start + j] += (x[j]*x[j])*static_cast<float>(n_as);
                }

and definitely isn't just a benign change of scale (the same model acts quite differently). The dbrx model is so flaky though it's hard to tell so I'm going to give mixtral-8x22b-instruct another try (that was completely broken for me before) and see what happens .

Is the intension that less selected expert MLPs are to have their weights downscaled proportionally? I assume under the old scheme that the allocation of bits was done on a per-tensor basis so this is a pretty big change as even with the fix above, it's now allocated based on the proportion of times the top-k gating network selects an expert's MLP? I can see arguments both for and against this though, so not 100% clear which would be best...

If the old behaviour is to be returned then it's not going to be easy to pass a vector of ncall without a breaking change of the imatrix file format, but I think this:

struct Stats {
std::vector values;
int ncall = 0;
};

Could be adapted inside of imatrix.cpp to keep track and then each time save_imatrix() is called, apply a correction factor so that the division by ncall down in quantize generates the correct result.

There is probably a far better way to do both these fixes by altering collect_imatrix() though. With:

        ++e.ncall;
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        //if (idx == t->src[0]->ne[0] - 1) ++e.ncall;

needing a look at too. This actually causes the printout to be confusing too:

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 248.05 ms
compute_imatrix: computing over 91 chunks with batch_size 512
compute_imatrix: 36.59 seconds per pass - ETA 55.48 minutes
[1]6.8864,[2]5.5590,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix
[3]4.6385,[4]5.2093,
save_imatrix: stored collected data after 20 chunks in dbrx:16x12b-instruct-f16.imatrix
[5]5.6050,[6]4.6732,[7]4.7876,
save_imatrix: stored collected data after 30 chunks in dbrx:16x12b-instruct-f16.imatrix
[8]5.3775,[9]5.6677,

@slaren
Copy link
Collaborator Author

slaren commented May 5, 2024

The intention was that during quantization, each expert is quantized separately using the fragment of the imatrix that corresponding to that expert:

llama.cpp/llama.cpp

Lines 14894 to 14900 in 628b299

for (int64_t i03 = 0; i03 < tensor->ne[2]; ++i03) {
const float * f32_data_03 = f32_data + i03 * nelements_matrix;
void * new_data_03 = (char *)new_data + ggml_row_size(new_type, n_per_row) * i03 * nrows;
const float * imatrix_03 = imatrix ? imatrix + i03 * n_per_row : nullptr;
new_size += llama_tensor_quantize_internal(new_type, f32_data_03, new_data_03, chunk_size, nrows, n_per_row, imatrix_03, workers, nthread_use);
}

But I may have gotten that wrong. Ultimately the goal was to preserve the same behavior.

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

I think this should maintain the old "per-expert" scaling behaviour:

struct Stats {
    std::vector<float> values;
    std::vector<int> counts; // +++
    int ncall = 0;
};
        //++e.ncall; // ---
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        if (idx == t->src[0]->ne[0] - 1) ++e.ncall; // +++
            if (e.values.empty()) {
                e.values.resize(src1->ne[0]*n_as, 0);
                e.counts.resize(src1->ne[0]*n_as, 0); // +++
            }
.
.
.
                for (int j = 0; j < (int)src1->ne[0]; ++j) {
                    e.values[e_start + j] += x[j]*x[j];
                    e.counts[e_start + j]++; // +++
                }
        if (e.values.empty()) {
            e.values.resize(src1->ne[0], 0);
            e.counts.resize(src1->ne[0], 0); // +++
        }
.
.
.
            for (int j = 0; j < (int)src1->ne[0]; ++j) {
                e.values[j] += x[j]*x[j];
                e.counts[j]++; // +++
            }
void IMatrixCollector::save_imatrix(const char * fname) const {
    std::ofstream out(fname, std::ios::binary);
    int n_entries = m_stats.size();
    out.write((const char*)&n_entries, sizeof(n_entries));
    for (auto& p : m_stats) {
        int len = p.first.size();
        out.write((const char*)&len, sizeof(len));
        out.write(p.first.c_str(), len);
        out.write((const char*)&p.second.ncall, sizeof(p.second.ncall));
        int nval = p.second.values.size();
        // +++
        std::vector<float> tmp(nval);
        for (int i = 0; i < nval; i++) {
            tmp[i] = (p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall);
        }
        out.write((const char*)&nval, sizeof(nval));
        if (nval > 0) out.write((const char*)tmp.data(), nval*sizeof(float));
        // +++
    }
    if (m_params.verbosity > 0) {
        fprintf(stderr, "\n%s: stored collected data after %d chunks in %s\n",__func__,m_last_call,fname);
    }
}

and shouldn't require any breaking changes to the imatrix file format, work correctly when joining imatrix files of differing sample sizes, and the for (auto& v : e) v /= ncall should works as expected in quantize without any changes needed.

It's O(d) currently (for simplicity) but could easily be reduced to O(n_experts) later as all the counts will be the same for each expert.

I'll re-quant dbrx and mixtral-8x22b and report back (possibly tomorrow now).


It also fixes the weird block saving behaviour for MOEs:

1]6.8864,[2]5.5590,[3]4.6385,[4]5.2093,[5]5.6050,[6]4.6732,[7]4.7876,[8]5.3775,[9]5.6677,
save_imatrix: stored collected data after 10 chunks in dbrx:16x12b-instruct-f16.imatrix

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

The intention was that during quantization, each expert is quantized separately using the fragment of the imatrix that corresponding to that expert:

llama.cpp/llama.cpp

Lines 14894 to 14900 in 628b299

for (int64_t i03 = 0; i03 < tensor->ne[2]; ++i03) {
const float * f32_data_03 = f32_data + i03 * nelements_matrix;
void * new_data_03 = (char *)new_data + ggml_row_size(new_type, n_per_row) * i03 * nrows;
const float * imatrix_03 = imatrix ? imatrix + i03 * n_per_row : nullptr;
new_size += llama_tensor_quantize_internal(new_type, f32_data_03, new_data_03, chunk_size, nrows, n_per_row, imatrix_03, workers, nthread_use);
}

But I may have gotten that wrong. Ultimately the goal was to preserve the same behavior.

Sorry, missed your post. Looking some more here:

        const float * xbl = x + QK_K*ibl;
        float sumx2 = 0;
        for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
        float sigma2 = 2*sumx2/QK_K;

        for (int ib = 0; ib < QK_K/32; ++ib) {
            const float * xb = xbl + 32*ib;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

then a change of scale shouldn't effect weight[i] so long as it is normalised later on, as it will just scale all the values in the 32-element block by the same constant.

BUT: I did get a different result earlier by multiplying all e.values[e_start + j] += x[j]*x[j] by 16, which makes me think that somewhere it is important to maintain the absolute scales...

Perhaps as a test it might be worth rounding each of the 16 experts differently and then checking all the complicated slicing loops are really dealing with the experts as expected using asserts up until they get quantized?

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

This commented out bit of code didn't work:

        //++e.ncall;
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        if (idx == t->src[0]->ne[0] - 1) ++e.ncall;

Just got ncall = 0 saved.


I'm still no wiser what t->src[0]->ne[0] is holding and have been staring at this for 30 minutes 🤣

Luckily by printing out some debugging info found this also works:

if (idx == 0) ++e.ncall;

(but probably triggers the if (e.ncall > m_last_call) ... save_imatrix() condition earlier than it should)

I guess t->src[0]->ne[0] should have held the number of top-k experts, but it's not quite this?

@jukofyork
Copy link
Contributor

It took a lot of digging, but I think I can see where absolute scales matter:

static void quantize_row_q4_0_impl(const float * restrict x, block_q4_0 * restrict y, int64_t n_per_row, const float * quant_weights) {
    static_assert(QK4_0 == 32, "QK4_0 must be 32");
    
    if (!quant_weights) {
        quantize_row_q4_0_reference(x, y, n_per_row);
        return;
    }   
        
    float weight[QK4_0]; 
    int8_t L[QK4_0];
        
    float sum_x2 = 0;
    for (int j = 0; j < n_per_row; ++j) sum_x2 += x[j]*x[j];
    float sigma2 = sum_x2/n_per_row;
        
    const int64_t nb = n_per_row/QK4_0;
    for (int ib = 0; ib < nb; ++ib) {
        const float * xb = x + QK4_0 * ib;
        const float * qw = quant_weights + QK4_0 * ib;
        for (int j = 0; j < QK4_0; ++j) weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
        float d = make_qx_quants(QK4_0, 8, xb, L, 1, weight); 
        y[ib].d = GGML_FP32_TO_FP16(d);
        for (int j = 0; j < 16; ++j) {
            y[ib].qs[j] = L[j] | (L[j+16] << 4);
        }
    }    
}       

The important bit: weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]).

static float make_qx_quants(int n, int nmax, const float * restrict x, int8_t * restrict L, int rmse_type,
        const float * restrict qw) {
    float max = 0;
    float amax = 0;
    for (int i = 0; i < n; ++i) {
        float ax = fabsf(x[i]);
        if (ax > amax) { amax = ax; max = x[i]; }
    }   
    if (amax < 1e-30f) { // all zero
        for (int i = 0; i < n; ++i) {
            L[i] = 0;
        }   
        return 0.f;
    }   
    float iscale = -nmax / max;
    if (rmse_type == 0) {
        for (int i = 0; i < n; ++i) {
            int l = nearest_int(iscale * x[i]);
            L[i] = nmax + MAX(-nmax, MIN(nmax-1, l));
        }
        return 1/iscale;
    }
    bool return_early = false;
    if (rmse_type < 0) {
        rmse_type = -rmse_type;
        return_early = true;
    }
    float sumlx = 0; 
    float suml2 = 0;
#ifdef HAVE_BUGGY_APPLE_LINKER
    // use 'volatile' to prevent unroll and work around a bug in Apple ld64 1015.7
    for (volatile int i = 0; i < n; ++i) {
#else
    for (int i = 0; i < n; ++i) {
#endif          
        int l = nearest_int(iscale * x[i]);
        l = MAX(-nmax, MIN(nmax-1, l));
        L[i] = l + nmax;
        float w = qw ? qw[i] : rmse_type == 1 ? x[i] * x[i] : rmse_type == 2 ? 1 : rmse_type == 3 ? fabsf(x[i]) : sqrtf(fabsf(x[i]));
        sumlx += w*x[i]*l;
        suml2 += w*l*l;
    }   
    float scale = sumlx/suml2;
    if (return_early) return suml2 > 0 ? 0.5f*(scale + 1/iscale) : 1/iscale;
    float best = scale * sumlx;
    for (int is = -9; is <= 9; ++is) {
        if (is == 0) {
            continue;
        }
        iscale = -(nmax + 0.1f*is) / max;
        sumlx = suml2 = 0;
        for (int i = 0; i < n; ++i) {
            int l = nearest_int(iscale * x[i]);
            l = MAX(-nmax, MIN(nmax-1, l));
            float w = qw ? qw[i] : rmse_type == 1 ? x[i] * x[i] : rmse_type == 2 ? 1 : rmse_type == 3 ? fabsf(x[i]) : sqrtf(fabsf(x[i]));
            sumlx += w*x[i]*l;
            suml2 += w*l*l;
        }
        if (suml2 > 0 && sumlx*sumlx > best*suml2) {
            for (int i = 0; i < n; ++i) {
                int l = nearest_int(iscale * x[i]);
                L[i] = nmax + MAX(-nmax, MIN(nmax-1, l));
            }
            scale = sumlx/suml2; best = scale*sumlx;
        }
    }
    return scale;
}

So scale = sumlx/suml2 does indeed cancel out an arbitrary scale factor applied to all of qw, but:

best = scale*sumlx = (sumlx/suml2)*sumlx = sumlx^2 / suml2.

and the non-linearity of sumlx^2 means this will effect this loop (I think - this code is hard going to follow!).

@jukofyork
Copy link
Contributor

jukofyork commented May 6, 2024

I added pull request #7099, but this doesn't just want pushing as is - it unnecessarily doubles the memory overhead, but hopefully can be used as a test to refactor, etc.

@jukofyork
Copy link
Contributor

I can confirm this is actually doing something useful as before Mixtral-8x22b-Instruct was a broken mess who kept stopping mid-sentence all the time and now appears to work as normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama : combine expert tensors into a single tensor