Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream #36

Merged
merged 49 commits into from
Sep 8, 2024
Merged

merge upstream #36

merged 49 commits into from
Sep 8, 2024

Commits on Sep 2, 2024

  1. llama : minor style

    ggerganov committed Sep 2, 2024
    Configuration menu
    Copy the full SHA
    c6d4cb4 View commit details
    Browse the repository at this point in the history
  2. build(nix): Package gguf-py (ggerganov#5664)

    * style: format with nixfmt/rfc101-style
    
    * build(nix): Package gguf-py
    
    * build(nix): Refactor to new scope for gguf-py
    
    * build(nix): Exclude gguf-py from devShells
    
    * build(nix): Refactor gguf-py derivation to take in exact deps
    
    * build(nix): Enable pytestCheckHook and pythonImportsCheck for gguf-py
    
    * build(python): Package python scripts with pyproject.toml
    
    * chore: Cleanup
    
    * dev(nix): Break up python/C devShells
    
    * build(python): Relax pytorch version constraint
    
    Nix has an older version
    
    * chore: Move cmake to nativeBuildInputs for devShell
    
    * fmt: Reconcile formatting with rebase
    
    * style: nix fmt
    
    * cleanup: Remove unncessary __init__.py
    
    * chore: Suggestions from review
    
    - Filter out non-source files from llama-scripts flake derivation
    - Clean up unused closure
    - Remove scripts devShell
    
    * revert: Bad changes
    
    * dev: Simplify devShells, restore the -extra devShell
    
    * build(nix): Add pyyaml for gguf-py
    
    * chore: Remove some unused bindings
    
    * dev: Add tiktoken to -extra devShells
    ditsuke authored Sep 2, 2024
    Configuration menu
    Copy the full SHA
    9c1ba55 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b60074f View commit details
    Browse the repository at this point in the history
  4. server : refactor multitask handling (ggerganov#9274)

    * server : remove multitask from server_task
    
    * refactor completions handler
    
    * fix embeddings
    
    * use res_ok everywhere
    
    * small change for handle_slots_action
    
    * use unordered_set everywhere
    
    * (try) fix test
    
    * no more "mutable" lambda
    
    * Apply suggestions from code review
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    
    * use deque
    
    ---------
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    ngxson and ggerganov authored Sep 2, 2024
    Configuration menu
    Copy the full SHA
    6e7d133 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f771d06 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    048de84 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f148516 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    48baa61 View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2024

  1. Configuration menu
    Copy the full SHA
    b69a480 View commit details
    Browse the repository at this point in the history
  2. llama-bench : add JSONL (NDJSON) output mode (ggerganov#9288)

    * llama-bench : add JSONL (NDJSON) output mode
    
    * llama-bench : update usage docs
    akx authored Sep 3, 2024
    Configuration menu
    Copy the full SHA
    8962422 View commit details
    Browse the repository at this point in the history
  3. flake.lock: Update (ggerganov#9261)

    Flake lock file updates:
    
    • Updated input 'flake-parts':
        'github:hercules-ci/flake-parts/8471fe90ad337a8074e957b69ca4d0089218391d?narHash=sha256-XOQkdLafnb/p9ij77byFQjDf5m5QYl9b2REiVClC%2Bx4%3D' (2024-08-01)
      → 'github:hercules-ci/flake-parts/af510d4a62d071ea13925ce41c95e3dec816c01d?narHash=sha256-ODYRm8zHfLTH3soTFWE452ydPYz2iTvr9T8ftDMUQ3E%3D' (2024-08-30)
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/c374d94f1536013ca8e92341b540eba4c22f9c62?narHash=sha256-Z/ELQhrSd7bMzTO8r7NZgi9g5emh%2BaRKoCdaAv5fiO0%3D' (2024-08-21)
      → 'github:NixOS/nixpkgs/71e91c409d1e654808b2621f28a327acfdad8dc2?narHash=sha256-GnR7/ibgIH1vhoy8cYdmXE6iyZqKqFxQSVkFgosBh6w%3D' (2024-08-28)
    
    Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
    ggerganov and github-actions[bot] authored Sep 3, 2024
    Configuration menu
    Copy the full SHA
    7605ae7 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2024

  1. Configuration menu
    Copy the full SHA
    9379d3c View commit details
    Browse the repository at this point in the history
  2. rpc : make RPC servers come first in the device list (ggerganov#9296)

    * rpc : make RPC servers come first in the device list
    
    * rpc : disable options for non-RPC builds
    
    * rpc : rpc_count always zero for non-RPC builds
    rgerganov authored Sep 4, 2024
    Configuration menu
    Copy the full SHA
    82e3b03 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c8671ae View commit details
    Browse the repository at this point in the history
  4. [SYCL] Fix DMMV dequantization (ggerganov#9279)

    Fixed dmmv dequant for ncols== GGML_SYCL_DMMV_X
    OuadiElfarouki authored Sep 4, 2024
    Configuration menu
    Copy the full SHA
    5910ea9 View commit details
    Browse the repository at this point in the history
  5. ggml : AVX2 support for Q4_0_8_8 (ggerganov#8713)

    * Add AVX2 based implementations for quantize_q8_0_4x8, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0 functions
    
    * Update code to fix issues occuring due to non alignment of elements to be processed as multiple of 16 in MSVC
    
    * Update comments and indentation
    
    * Make updates to reduce number of load instructions
    Srihari-mcw authored Sep 4, 2024
    Configuration menu
    Copy the full SHA
    581c305 View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2024

  1. Configuration menu
    Copy the full SHA
    bdf314f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4db0478 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1031771 View commit details
    Browse the repository at this point in the history
  4. Update build.yml (ggerganov#9184)

    build rpc-server for windows cuda
    awatuna authored Sep 5, 2024
    Configuration menu
    Copy the full SHA
    32b2ec8 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. ggml-quants : ternary packing for TriLMs and BitNet b1.58 (ggerganov#…

    …8151)
    
    * ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
    
    * ggml-quants : faster 1.625 bpw AVX2 vec_dot
    
    Not using a lookup table anymore makes it match q4_0 speed.
    
    * gguf-py : fix formatting
    
    * llama : remove spaces on empty line
    
    * ggml-quants : subtract 1 when back in epi8
    
    This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
    
    * ggml-quants : Q2_2 now faster than Q4_K on with AVX2
    
    * ggml-quants : cleanup Q1_3 code formatting
    
    * ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
    
    * ggml-quants : use ceiling division when quantizing q1_3
    
    * convert-hf : simplify BitNet pre-quantization
    
    This still results in the exact same tensor weights and scales,
    but it reveals some weirdness in the current algorithm.
    
    * convert-hf : allow converting the weird BitNet 1.3B
    
    Its FFN size is 5460 which is not convenient.
    The offending tensors are kept in F16,
    which makes the final model 5.01 bpw.
    
    * bitnet : replace 1.58b with b1.58, as in the paper
    
    * ggml-quants : fix build failure on Windows
    
    * ggml-quants : attempt to fix Arm 32-bit support
    
    * ggml : add some informative comments in q1_3 vec_dot
    
    * ggml : add TQ1_0 and TQ2_0 ternary quantization types
    
    * ggml : even faster TQ2_0
    
    * ggml : also faster TQ1_0
    
    Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
    This makes TQ1_0 almost as fast as Q8_0 on AVX2.
    
    * ggml : fix build issues in certain environments
    
    * ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
    
    * ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
    
    The compiler seems smart enough to use the same instruction
    even when using vget_high_s8 instead.
    
    * ggml : remove q1_3 and q2_2
    
    No more 1.625 bpw and 2.000 bpw,
    now instead using 1.6875 bpw and 2.0625 bpw
    with TQ1_0 and TQ2_0, respectively.
    
    * llama : remove the separate scale tensors of BitNet b1.58
    
    They won't be needed, since the remaining ternary quant types have
    built-in scales.
    
    * ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
    
    * ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
    
    Not yet tested on hardware which supports it,
    might not work or might not even compile. But also it might.
    It should make the performance better on recent ARM CPUs.
    
    * ggml-quants : remove comment about possible format change of TQ2_0
    
    Making it slightly more convenient for AVX512
    but less convenient for everything else is not worth the trouble.
    
    * gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
    
    * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0
    
    This does not change anything for ternary models,
    since their values should never end up being in halfway cases anyway.
    
    * convert : allow direct conversion to TQ1_0 and TQ2_0
    
    The token embeddings and output tensors are kept in F16
    to allow quantizing them to Q4_K and Q6_K with llama-quantize.
    
    * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0
    
    Q4_0 is not completely symmetric (so not lossless for ternary models),
    but it should be good enough.
    
    * ggml-quants : allow using ARM dot product instructions for TQ1_0
    
    * ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
    
    * ggml : remove unused ggml_mul special case
    
    It would otherwise conflict with the more general
    optimization coming with Mamba-2.
    
    * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
    
    * test-backend-ops : add TQ1_0 and TQ2_0 comments for later
    
    Not yet adding uncommented, because some backends like SYCL and Metal
    do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
    (and Metal also doesn't handle it with GGML_OP_GET_ROWS)
    Support for TQ1_0 and TQ2_0 for other backends than CPU
    will be added in follow-up pull requests.
    compilade authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    9bc6db2 View commit details
    Browse the repository at this point in the history
  2. Improve Vulkan shader build system (ggerganov#9239)

    * Improve Vulkan shader builds system
    
    - Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
    - Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools
    
    * remove not required self dependency
    mtavenrath authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    8ebe8dd View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4a1411b View commit details
    Browse the repository at this point in the history
  4. ggml : fix build break for the vulkan-debug (ggerganov#9265)

    - windows build : Ok.
    - linux build : Ok.
    
    Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com>
    cyzero-kim authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    409dc4f View commit details
    Browse the repository at this point in the history
  5. batched-bench : add --output-format jsonl option (ggerganov#9293)

    `--output-format` is modeled after `llama-bench`'s options
    akx authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    815b1fb View commit details
    Browse the repository at this point in the history
  6. llama-bench : log benchmark progress (ggerganov#9287)

    * llama-bench : add optional progress messages
    akx authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    134bc38 View commit details
    Browse the repository at this point in the history
  7. server : simplify state machine for slot (ggerganov#9283)

    * server : simplify state machine for slot
    
    * add SLOT_STATE_DONE_PROMPT
    
    * pop_deferred_task
    
    * add missing notify_one
    
    * fix passkey test
    
    * metrics : add n_busy_slots_per_decode
    
    * fix test step
    
    * add test
    
    * maybe fix AddressSanitizer?
    
    * fix deque ?
    
    * missing lock
    
    * pop_deferred_task: also notify
    
    * Update examples/server/server.cpp
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    
    ---------
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    ngxson and ggerganov authored Sep 6, 2024
    Configuration menu
    Copy the full SHA
    9b2c24c View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2024

  1. Configuration menu
    Copy the full SHA
    6c89eb0 View commit details
    Browse the repository at this point in the history
  2. ggml : fix missing cpu_set_t on emscripten (ggerganov#9336)

    * ggml : fix missing cpu_set_t on emscripten
    
    * better version
    
    * bring back android part
    ngxson authored Sep 7, 2024
    Configuration menu
    Copy the full SHA
    947538a View commit details
    Browse the repository at this point in the history
  3. llama : refactor sampling v2 (ggerganov#9294)

    - Add `struct llama_sampler` and `struct llama_sampler_i`
    - Add `llama_sampler_` API
    - Add `llama_sampler_chain_` API for chaining multiple samplers
    - Remove `LLAMA_API_INTERNAL`
    - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
    ggerganov authored Sep 7, 2024
    Configuration menu
    Copy the full SHA
    df270ef View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    e32d081 View commit details
    Browse the repository at this point in the history
  5. common : refactor arg parser (ggerganov#9308)

    * (wip) argparser v3
    
    * migrated
    
    * add test
    
    * handle env
    
    * fix linux build
    
    * add export-docs example
    
    * fix build (2)
    
    * skip build test-arg-parser on windows
    
    * update server docs
    
    * bring back missing --alias
    
    * bring back --n-predict
    
    * clarify test-arg-parser
    
    * small correction
    
    * add comments
    
    * fix args with 2 values
    
    * refine example-specific args
    
    * no more lamba capture
    
    Co-authored-by: slaren@users.noreply.github.com
    
    * params.sparams
    
    * optimize more
    
    * export-docs --> gen-docs
    ngxson authored Sep 7, 2024
    Configuration menu
    Copy the full SHA
    1b9ae51 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    e536426 View commit details
    Browse the repository at this point in the history
  7. llama : sanitize invalid tokens (ggerganov#9357)

    * common : do not add null tokens during warmup
    
    ggml-ci
    
    * llama : check that the input tokens are valid
    
    ggml-ci
    
    * tests : fix batch size of bert model
    
    ggml-ci
    ggerganov authored Sep 7, 2024
    Configuration menu
    Copy the full SHA
    faf69d4 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    f12295b View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    a5b5d9a View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2024

  1. Configuration menu
    Copy the full SHA
    fbb7fcf View commit details
    Browse the repository at this point in the history
  2. ggml : fix cont with transposed tensors when one dimension is 1 (ggml…

    …/934)
    
    * ggml_cont: fix issue with transposed tensors when one dimension is 1
    
    when using multiple threads, it is not enough
    to check for the tensors to be contiguous for
    ggml_compute_forward_dup_same_cont to work correctly.
    The tensors strides also need to match.
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    
    * Add ggml_cont tests
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    
    * Remove dead code
    
    it isn't possible to reach this code because
    all these functions are invoked by ggml_compute_forward_dup
    if and only if src0->type != dst->type
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    
    * Make ggml_compute_forward_dup_same_cont work with contiguous tensors
    
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    
    ---------
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    smeso and ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    efe6a83 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    51d964a View commit details
    Browse the repository at this point in the history
  4. cann : add Ascend NPU support (whisper/2336)

    * enable Ascend NPU in src/whisper.cpp
      * sync test-backend-ops with llama.cpp
    MengqingCao authored and ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    d2d3200 View commit details
    Browse the repository at this point in the history
  5. cann : fix doxy (ggml/0)

    ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    ba1cf84 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    dbbebca View commit details
    Browse the repository at this point in the history
  7. tests: add gradient tests for all backends (ggml/932)

    * tests: add gradient checking to test-backend-ops
    
    * remove old comment
    
    * reorder includes
    
    * adjust SIN/COS parameters
    
    * add documentation, use supports_op if possible
    JohannesGaessler authored and ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    202084d View commit details
    Browse the repository at this point in the history
  8. vulkan: correctly report support for OP_CONT (ggml/946)

    test-backend-ops fails because ggml_cont aborts
    when invoked passing an unsupported type.
    
    This commit makes ggml_cont tests pass
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    smeso authored and ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    9cb9260 View commit details
    Browse the repository at this point in the history
  9. vulkan: add dryrun support to sin and cos ops (ggml/947)

    sin and cos failed test-backend-ops because they
    tried to dereference a context pointer that is null
    on dry runs.
    
    This commit prevents that segfault.
    
    Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
    smeso authored and ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    406c1a3 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    60a3107 View commit details
    Browse the repository at this point in the history
  11. sync : ggml

    ggerganov committed Sep 8, 2024
    Configuration menu
    Copy the full SHA
    385decb View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    a876861 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    d11bd3b View commit details
    Browse the repository at this point in the history