Coding style guide (#95)

Co-authored-by: Torsten Scholak <torsten.scholak@googlemail.com>
ServiceNow · Dec 19, 2024 · 7d9d998 · 7d9d998
1 parent f4a0cdd
commit 7d9d998
Show file tree

Hide file tree

Showing 17 changed files with 192 additions and 45 deletions.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -9,25 +9,25 @@ assignees: ''
 
 # 🧐 Problem Description
 
-Is your feature request related to a specific problem? Please describe it clearly.  
+Is your feature request related to a specific problem? Please describe it clearly.
 For example: "I'm always frustrated when [...]"
 
 # 💡 Proposed Solution
 
-Describe the solution you would like to see.  
+Describe the solution you would like to see.
 Be as specific as possible about how it would work or be implemented.
 
 # 🔄 Alternatives Considered
 
-Have you considered any alternative solutions or approaches?  
+Have you considered any alternative solutions or approaches?
 If so, please describe them and explain why they might not be ideal.
 
 # 📈 Potential Benefits
 
-Explain how this feature could benefit Fast-LLM users.  
+Explain how this feature could benefit Fast-LLM users.
 Consider how it might improve performance, usability, scalability, etc.
 
 # 📝 Additional Context
 
-Add any other context or information that could help us understand the feature request better.  
+Add any other context or information that could help us understand the feature request better.
 If applicable, provide links to relevant references or examples.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,6 +1,6 @@
 # ✨ Description
 
-Please provide a brief summary of the changes, relevant motivation, and context.  
+Please provide a brief summary of the changes, relevant motivation, and context.
 Include any related issue numbers or links to discussions, and explain why this change is necessary.
 
 Closes # <!-- Insert issue number here, if applicable -->

diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -19,7 +19,7 @@ jobs:
     steps:
       - name: Checkout repository
         uses: actions/checkout@v4
-      
+
       - uses: actions/setup-python@v5
         with:
           python-version: '3.12'

diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -21,7 +21,7 @@ jobs:
         with:
           python-version: "3.12"
           cache: "pip"
-      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
       - uses: actions/cache@v4
         with:
           key: mkdocs-material-${{ env.cache_id }}
@@ -46,7 +46,7 @@ jobs:
         with:
           python-version: "3.12"
           cache: "pip"
-      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
       - uses: actions/cache@v4
         with:
           key: mkdocs-material-${{ env.cache_id }}

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -2,7 +2,7 @@
 # See https://pre-commit.com/hooks.html for more hooks
 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.6.0
+    rev: v5.0.0
     hooks:
     -   id: trailing-whitespace
     -   id: end-of-file-fixer
@@ -11,7 +11,7 @@ repos:
         - --unsafe
     -   id: check-added-large-files
 -   repo: https://github.com/asottile/pyupgrade
-    rev: v3.17.0
+    rev: v3.19.1
     hooks:
     -   id: pyupgrade
         args:
@@ -42,9 +42,19 @@ repos:
         name: isort (pyi)
         types: [pyi]
 -   repo: https://github.com/psf/black
-    rev: 24.8.0
+    rev: 24.10.0
     hooks:
     -   id: black
         args:
             - "--config"
             - "./pyproject.toml"
+-   repo: https://github.com/DavidAnson/markdownlint-cli2
+    rev: v0.16.0
+    hooks:
+    -   id: markdownlint-cli2
+        name: markdownlint
+        files: "docs/"
+        entry: markdownlint-cli2
+        args: ["--fix"]
+        language: node
+        types: [markdown]
diff --git a/docs/.markdownlint.yaml b/docs/.markdownlint.yaml
@@ -20,13 +20,20 @@ MD010:
 # MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
 MD013: false
 
+# Temporarily disabled because not automatically fixed.
 # MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
-MD030:
+MD030: false
   # Spaces for single-line unordered list items
-  ul_single: 3
+  # ul_single: 3
   # Spaces for single-line ordered list items
-  ol_single: 2
+  # ol_single: 2
   # Spaces for multi-line unordered list items
-  ul_multi: 3
+  # ul_multi: 3
   # Spaces for multi-line ordered list items
-  ol_multi: 2
+  # ol_multi: 2
+
+# Code block style (disable because of interactions with mkdocs note blocks)
+MD046: false
+
+# Link and image reference definitions (disable because of interactions with mkdocs footnotes)
+MD053: false
diff --git a/docs/about-us.md b/docs/about-us.md
@@ -27,7 +27,7 @@ At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 
 Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
 
--   [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+-   **Joel Lamy Poirier**](<https://www.servicenow.com/research/author/joel-lamy-poirier.html>) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
 -   [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
 -   [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 

diff --git a/docs/assets/images/logo.svg b/docs/assets/images/logo.svg
diff --git a/docs/developers/style-guide.md b/docs/developers/style-guide.md
@@ -2,6 +2,133 @@
 title: Style Guide
 ---
 
-!!! warning
+This section collects general coding style guidelines used in Fast-LLM.
+Following these will ensure a swift reviewing process and will help maintain consistency and readability.
+Note that while we try to enforce these principles,
+exceptions may be allowed on a case-by-case basis, for example if they noticeably improve readability.
 
-    This section is work in progress. Please check back soon for the updated content.
+As a general principle, **Fast-LLM prioritizes code readability and maintainability** over conciseness,
+coding speed or individual programmer's preferences.
+Most of the style choices below are based on this principle.
+
+## 🎯 Basic Style
+
+Unless otherwise specified, Fast-LLM follows the [PEP 8](https://peps.python.org/pep-0008/) coding style.
+This style (and many other conventions) is enforced with automatic formatting through a [pre-commit](https://pre-commit.com/) git hook.
+
+Please make sure these git hooks are installed by running
+
+```bash
+pip install pre-commit
+pre-commit install
+```
+
+!!! note "More on automated formatting"
+    Fast-LLM's automated formatting includes [Black](https://black.readthedocs.io/en/stable/),
+    [isort](https://pycqa.github.io/isort/), [autoflake](https://github.com/PyCQA/autoflake), and a few other packages.
+    See Fast-LLM's [pre-commit configuration](https://github.com/ServiceNow/Fast-LLM/blob/main/.pre-commit-config.yaml) for more details.
+
+## 📚 Naming Conventions
+
+In addition to PEP 8, we use the following naming conventions for python identifiers (classes, variables, methods, modules, etc.),
+file names and configuration parameters.
+For example:
+
+*   Use meaningful, self-descriptive identifier names (ex. `x -> loss`).
+Abstract variable names such as `x` are however OK for generic methods where more descriptive names aren't appropriate (ex. `add(x, y)`).
+*   Please avoid abbreviations, especially domain-specific ones.
+This gives everyone a chance to understand the code, regardless of their prior knowledge. Ex. `bs -> batch_size`.
+*   Try to keep names concise, for example by eliminating redundancies
+and avoiding data type qualifiers such as `num` (covered by the type hint).
+This is especially important for configuration parameters as the fully qualified names can get very long.
+For example, `transformer.num_transformers_heads` can be simplified to `transformer.heads` without sacrificing clarity.
+
+Note that these conventions are especially important on user-facing names which are more difficult to change,
+for example configuration parameters and the public interface of core classes and modules.
+
+!!! note "Why this matters"
+    Using explicit, self-explanatory names gives other users a better chance to understand the code,
+    regardless of their prior knowledge, which facilitates collaboration and maintenance.
+    Our conventions follow this principle, while attempting to avoid excessively long names.
+
+## 🛬 Imports
+
+We use the following conventions for imports (other than those enforced by isort):
+
+*   Import standard library and third party modules by module (ex. `import package.module`, not `from package.module import method`).
+In addition to keeping the code consistent, this keeps identifier's origin explicit so anyone can tell where it came from with just a quick glance at the code.
+*   Avoid renaming with `as`, except for some (arbitrarily chosen) common ones: `numpy as np`, `triton.language as tl`.
+*   Import first-party modules through specific identifiers (ex. `from fast_llm.module import method`, not `import fast_llm.module`). This keeps Fast-LLM identifiers to a manageable length and makes it easier to track what is used in a given file.
+*   Always use absolute imports (ex. no `from .module import method`)
+*   Include all explicitly-imported third-party module to `setup.cfg`.
+Only add new requirements if they provide a substantial benefit,
+as we try to keep the requirements to a minimum.
+*   Prefer file-level imports over imports inside methods, unless they significantly slow down the import process
+or concern an optional dependency that should not be absolutely required to import the module (ex. `transformers`).
+If an offending import is only required for a type hint, include it in a `if typing.TYPE_CHECKING:` block.
+
+!!! note "Why this matters"
+    Most python conventions make no clear recommendation concerning imports,
+    which can easily lead to inconsistent import formats across a repo, and can make it harder to understand.
+    Our conventions aim to avoid these arbitrary choices by providing an explicit prescription,
+    which should be good enough nearly everywhere. Our choice is justified as follows:
+
+    * For third-party and standard library packages, fully qualified identifiers are typically relatively short,
+    so it makes sense to keep them.
+    This also keeps identifier's origin explicit so anyone can tell where it came from with just a quick glance at the code.
+    This is especially useful for identifiers that with otherwise ambiguous source (ex. `float32` may come from torch, numpy, triton, etc.; Fast-LLM's configuration scheme has many identifiers in common with `dataclasses`, `omegaconf` and `pydantic`)
+    * For first-package, fully qualified names are generally too long to use in code,
+    since they include the entire directory structure to the Fast-LLM,
+    so first-party identifiers need to be imported by name.
+    There should be very little ambiguity, because name clashes are uncommon within Fast-LLM,
+    and external identifiers are already clearly marked as such.
+
+!!! warning "Configuration modules"
+    Fast-LLM supports instantiation and validation of configurations with a barebone installation.
+    Because of this, modules that contain configuration classes (usually named `config.py`)
+    should not include any top-level third-party import (except for those installed in the [barebone install](https://github.com/ServiceNow/Fast-LLM/blob/main/setup.cfg)),
+    and the same applies for configuration initialization and validation methods.
+    Third-party import may be included in other methods if needed.
+
+## 🔓 Public and Private Interface
+
+We use the following conventions for class and module interfaces:
+
+*   Mark private and protected variables with an underscore `_` prefix.
+As is customary in python, we make no distinction between the two and avoid the double-underscore `__` notation.
+*   Keep public interfaces (methods and variables without underscore prefix) as lean as possible,
+i.e. mark everything as private/protected unless there is a clear need to make it public.
+We can always add to the public interface later, but removing from it is difficult.
+*   Use accessors sparingly through the `@property` decorator or equivalent,
+usually to define read-only public variables.
+
+!!! note "Why this matters"
+    Although good practices of object-oriented programming are generally ignored in python,
+    Fast-LLM attempts to follow them to an extent, while avoiding unnecessary bloat.
+    Public interfaces are expected to be stable,
+    which make further modifications difficult as they could break external code.
+    On the other hand, private interface are freely modifiable,
+    which provides more freedom for fixes, improvement, refactoring, etc.
+    Therefore, having lean public interfaces is critical for us to keep maintaining and improving Fast-LLM.
+
+## 💡 Type Hints
+
+Fast-LLM uses type hints for several reasons, including code readability, type checking in IDEs,
+and type validation for configurations:
+
+*   Always use type hints for the public interface of a classes and modules.
+Type hints for method outputs may be omitted if they can be trivially inferred,
+ex. if they return the input, an explicitly typed variable or nothing.
+*   Prefer using type hints in private interfaces, especially if it improves readability and/or static type checking.
+*   Prefer newer type hint formats over older ones, ex. `typing.List -> list`, `typing.Union(A,B) -> A | B`.
+
+!!! note "Why this matters"
+    We use type hints for various reasons. In addition to making the code more understandable,
+    they are used by IDEs such as VS Code or PyCharm to perform static type checking,
+    which speeds up development and is essential to keeping the code bug-free.
+
+## 🗑️ Misc
+
+*   Please add descriptions and comments as needed, especially for parts that would otherwise be difficult to understand.
+*   Please favor `pathlib` over `os.path` for file path operations because it offers a cleaner and more modern API.
+*   We encourage the use of modern python features when beneficial, up to the minimum python version (3.12).
diff --git a/docs/help.md b/docs/help.md
@@ -17,7 +17,7 @@ Let's stay one step ahead of those pesky gotchas. Here's a list of common issues
     ```bash
     RuntimeError: Desync detected for barrier train begin (66830148464 != 133042721120)
     ```
-  
+
     points to a hashing inconsistency. To fix it, set `PYTHONHASHSEED=0` in your environment variables. This ensures that Python's hash seed is consistent across all processes. If these processes have different hash seeds, they'll generate different hash values, leading to desynchronization, as seen in the error message.
 
 -   **`torchrun` Timeout Errors**: If you see timeout errors related to `torchrun` during rendezvous, it could be DNS resolution or a networking issue. Check that all worker nodes are communicating properly with the master node.
@@ -27,7 +27,7 @@ Let's stay one step ahead of those pesky gotchas. Here's a list of common issues
     ```bash
     Watchdog caught collective operation timeout: WorkNCCL(SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout(ms)=600000) ran for 600351 milliseconds before timing out
     ```
-  
+
     appearing across all GPU workers, it usually means one or more hosts failed to complete a NCCL operation, causing others to block. NCCL errors can be frustrating to diagnose since they rarely specify which node or GPU caused the issue. It is difficult to surface which messages and operations are in progress during these crashes. If the issue happens at a specific moment of training like dataset preparation or model export, the issue might be that this specific procedure took too long and timed out other processes (e.g. when preparing large datasets for long training runs, or saving large models on slow storage). In this case, it can help to increase the timeout `distributed_timeout: 3600`.
     In some other cases, the best we can do is to restart the training job and hope it doesn't happen again. If the issue persists, it might be because of network congestion or a problematic GPU. If the worker that crashed is consistent across multiple runs, it's likely a hardware issue. If you can't resolve it, open an issue on GitHub, and we'll help you troubleshoot.
 

diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -45,8 +45,8 @@ Now, select the compute environment that matches your setup or preferred workflo
 
     If you prefer not to use the prebuilt Docker image or already have an environment you'd like to use (e.g., a custom Docker image, virtual environment, or bare-metal setup), follow these steps to install the necessary software and dependencies:
 
-    1.  **Ensure Python 3.10**:
-        Install Python 3.10 (or later) if it's not already available on your system. For a Python virtual environment, run:
+    1.  **Ensure Python 3.12**:
+        Install Python 3.12 (or later) if it's not already available on your system. For a Python virtual environment, run:
 
         ```bash
         python3.10 -m venv ./fast-llm-tutorial/venv

diff --git a/docs/recipes/data-preparation.md b/docs/recipes/data-preparation.md
@@ -35,7 +35,7 @@ Let's create a folder called `hf_dataset` and download the-stack dataset from hu
 
 ```bash
 mkdir ./prep-stack-tutorial/hf_dataset
-while ! huggingface-cli download bigcode/the-stack --revision v1.2 --repo-type dataset --max_workers 64 --local-dir ./prep-stack-tutorial/hf_dataset; 
+while ! huggingface-cli download bigcode/the-stack --revision v1.2 --repo-type dataset --max_workers 64 --local-dir ./prep-stack-tutorial/hf_dataset;
 do sleep 1; done
 ```
 
@@ -60,7 +60,7 @@ And then download the tokenizer with this Python script:
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 model_id = "mistralai/Mistral-Nemo-Base-2407"
-tokenizer = AutoTokenizer.from_pretrained(model_id) 
+tokenizer = AutoTokenizer.from_pretrained(model_id)
 tokenizer.save_pretrained("./prep-stack-tutorial/checkpoints/Mistral-Nemo-Base-2407")
 ```
 

diff --git a/fast_llm/data/gpt/memmap.py b/fast_llm/data/gpt/memmap.py
@@ -7,7 +7,7 @@
 from fast_llm.data.gpt.dataset import GPTIndexedDataset
 from fast_llm.data.preparator.gpt_memmap.config import MEMMAP_DTYPES, MEMMAP_DTYPES_INV, MEMMAP_INDEX_HEADER
 from fast_llm.engine.config_utils.data_type import DataType
-from fast_llm.utils import Assert, div, padded_cumsum
+from fast_llm.utils import Assert, div
 
 
 class GPTMemmapDataset(GPTIndexedDataset):

diff --git a/fast_llm/data/preparator/gpt_memmap/prepare.py b/fast_llm/data/preparator/gpt_memmap/prepare.py
@@ -1,7 +1,6 @@
 import json
 import multiprocessing
 import pathlib
-import shutil
 
 import datasets
 import numpy as np

diff --git a/fast_llm/engine/config_utils/run.py b/fast_llm/engine/config_utils/run.py
@@ -248,7 +248,9 @@ def is_main_rank():
     return DistributedConfig.default_rank == _MAIN_RANK
 
 
-def log_main_rank(*message, log_fn: typing.Union[type[BaseException], typing.Callable] = logger.info, join: str = ", "):
+def log_main_rank(
+    *message, log_fn: typing.Union[type[BaseException], typing.Callable] = logger.info, join: str = ", "
+):
     if is_main_rank():
         log(*message, log_fn=log_fn, join=join)