Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking the RoughPy algebra module #63

Open
inakleinbottle opened this issue Jan 22, 2024 · 2 comments
Open

Rethinking the RoughPy algebra module #63

inakleinbottle opened this issue Jan 22, 2024 · 2 comments

Comments

@inakleinbottle
Copy link
Contributor

inakleinbottle commented Jan 22, 2024

The algebra module of RoughPy is built around the assumption that the algebra types will be provided by an external library like libalgebra-lite (the current default) or libalgebra. However, this is unsatisfactory for a number of reasons:

  1. The complexity of building polymorphic hierarchies built around templated wrappers and interfaces is very hard to understand. Moreover, adding new vector/algebra types will always require adding C++ source code for the interface and wrapper types, along with the library implementation of the vector type.
  2. This system does not allow us to build on top of the RoughPy scalar system, which also plugs in to the multi-device support and kernel-based computation that this requires. Porting this functionality into the provider libraries (such as libalgebra-lite) will be problematic, due to the amount of additional work that goes in to supporting these features.
  3. Provider libraries represent a duplication of work and, possibly more importantly, dependencies. For instance, both Libalgebra-lite and RoughPy require the GMP libraries, which means that both the libalgebra-lite build system and the RoughPy build system have to locate the installed version of GMP. Unfortunately, this is rather tricky because we're using vcpkg for dependency management. This is certainly causing problems on some platforms, and will likely continue to do so. Moreover, it is completely unnecessary since the scalar system in RoughPy is far more flexible than the coefficient system of libalgebra (lite).
  4. Templates are expensive for compilation, which makes compiling RoughPy in a CI/CD pipeline problematic.
  5. Maintaining two libraries is more work than maintaining one.

I'm planning to shift RoughPy to use its own vector/algebra hierarchy built around its scalar system and a new basis/multiplication system designed to be modular, efficient, and (most importantly) simple. This will also allow us to build up from the device kernel system that is built into the scalar system, and will practically remove all templates from the algebra module. Of course, this needs to be properly thought through, so I opened this issue to keep track of my reasoning and designs, and also to allow comments from other contributors/users who might have feelings about this issue.

Vector design

In Libalgebra (lite), a vector is a class template defined (essentially) as follows:

template <typename Basis, typename Coefficients, typename VectorType>
class Vector {
    ...
};

This means that vectors with different coefficients and different bases have different types. This is desirable in C++, but it really causes a headache at the Python level. Regardless of whether it is mathematically sensible or not, we want (need) to be able to add vectors with different scalars to one another, provided that their bases are compatible, whenever addition is well-defined. Constructing the more esoteric scalar types in Python is a rather painful task, especially if the scalars in question are just integers. The scalar system already handles type-agnostic storage and conversion in a very easily extensible way, and this should already be completely sufficient for a vector type.

The second part is the bases. These are actually the most important thing that are imported from the external libraries. However, we already have to create a wrapper class that gives a consistent interface to each basis, which at best replicates effort. Porting the tensor basis and Lie basis into RoughPy makes far more sense than to rely on libalgebra lite purely for the bases.

The final consideration is the vector type, which is actually one of the more complex problems. It should not matter how the internal mechanics of a vector are represented, only that their bases are the same (and coefficients are compatible). Unfortunately, this is rather tricky with the current system. Using the scalar system, we can do both at once using the KeyScalarArray container that wraps an array of scalars together with an optional array of basis keys.

On the topic of basis keys, each of the libraries keeps its basis keys in various different forms. I tried to homogenize these in RoughPy, but ultimately this solution was unsatisfactory. Reworking the algebras and bases also allows me to rethink the keys at the same time.

The new basic vector type should now look like this:

class Vector {
    KeyScalarArray m_data;
    Basis m_basis;
    
    ...
};

Note that KeyScalarArray contains the scalar type information along with the data. A vector is sparse if m_data.has_keys() is true, and dense otherwise. We'll have to rework the KeyScalarArray slightly to accommodate my plans for basis keys, but otherwise the amount of changes is very minimal. We may have to include a Context pointer in the vector, since this is used for many things but with the proposed changes this may no longer be necessary. Contexts formerly collected together the algebra types of the same configuration (e.g. width 5, depth 2, floating point coefficients from libalgebra-lite). However, with the new setup, there won't be any need for an environment that understands the real type of the algebras (rather than their interface). Contexts will still be useful, but not as essential.

Key Design

This shift to a more generic structure will also mean a fairly substantial change to the way that basis keys are interpreted. All bases will, fundamentally, need to use the same key type as one another. However, this doesn't mean they all have to be identical. A key can be a pointer-like object that has two modes: index mode or pointer mode. In index mode, the content of the key is simply the index thereof in the basis total order (should it have an order). In pointer mode, the key is actually a pointer to some other data that defines the key, and it is up to the basis to interpret this form. Moreover, using bit-packing we can fit this neatly inside a pointer-sized type by stuffing the mode indicator bit into the low bits of a pointer, which are always zero for sufficiently aligned types. (We use the same technique in packing the scalar storage mode in scalars/packed_scalar_type_pointer.h.) If we go one step further, we can actually pack in a degree value alongside the index, using 1 byte of the 4/8 available (or less, maybe depending on the size of a pointer).

The pointer should be to a basis interface class that minimally provides a method to retrieve its basis, determine its type, compare two keys, and compute a hash value:

enum class BasisKeyType {
    Simple = 0,
    Ordered = 1,
    Graded = 2,
    WordLike = 3,
};
class BasisKey {
public:
    virtual Basis basis() const noexcept = 0;
    virtual BasisKeyType type() const noexcept; // default Simple
    virtual bool compare_equal(const BasisKey& other) const noexcept = 0;
    virtual hash_t hash_value() const noexcept = 0;   // if ordered, must agree with index
};

We can build on top of this in a hierarchy to implement the different levels of bases. Ordered bases need to implement "less" and "index", graded bases need to implement "degree", and word-like bases need to be ordered, graded, and implement "parents".

Basis design

The basis is responsible for decoding and interpreting BasisKey objects and providing a means for vectors to structure themselves correctly.
Most of our key types are actually word-like, or tree-like in the sense that one can extract two parent sub-words that are combined to form the full word.
Tensor keys satisfy this by taking the first letter from the tailing word, and Hall basis keys satisfy this by taking the two unique Hall words whose bracket is the given word.
(For letters, this pairing is the "initial element" together with the letter itself.)
It is the responsiblity of the basis to perform the following actions:

  1. Convert a BasisKey into a string for printing.
  2. Determine of two BasisKeys are equal.
  3. Compute the hash of a BasisKey.
  4. Determine whether a BasisKey is valid for the basis.
  5. When the basis is totally ordered:
    • determine whether one BasisKey preceeds another,
    • compute the index of a BasisKey according to the basis order,
    • compute the BasisKey at a given index within the basis order,
    • produce an iterable that enumerates all keys associated with the basis.
  6. When the basis is graded:
    • compute the degree of a given BasisKey,
    • produce a constant maxmimum degree supported,
    • produce iterables that enumerate all keys of a given degree.
  7. When the basis is word-like it should be both graded and totally ordered, and:
    • determine whether a BasisKey represents a letter or a word,
    • convert a BasisKey representing a letter into a letter (integer),
    • convert a BasisKey representing a word into a pair of BasisKeys representing it's parents.
@inakleinbottle
Copy link
Contributor Author

Everything here has to be done with kernels (using the tools from platform/devices). This includes things like vector arithmetic, all of the different multiplications - Free tensor, (half) shuffles, Lie - and the various other operations that are needed. The advantage of this approach is that it is completely agnostic as to where the data is stored and where the computations should take place. These details are handled by the device runtime and the dispatcher. There are a couple of problems though. First, we have implementations of most of these operations as kernels, but we don't yet have a working implementation of the Lie multiplication.
Second, we have working implementations for densely stored vectors, but we don't have anything that will currently work for sparse vectors.

I expect that Lie multiplication won't be too hard to construct as a kernel for the "standard" Hall basis and densely stored vectors, provided we can understand the index pattern of the multiplication.

Sparse computations are far more complicated. We don't know, a priori, which keys will have non-zero coefficients in the result. This is especially problematic for the non-standard bases - though no such bases are necessary at the moment. For ordered bases, we can do some work in advance to figure out which keys might appear in the result. The kernel can then populate the result buffer. However, we will have to do one additional step to filter out the coefficients that end up being zero - unless we drop this restriction on sparse representations.

@inakleinbottle
Copy link
Contributor Author

inakleinbottle commented Feb 14, 2024

The next problem to tackle is kernel design. I think I've got this down now, but there is a lot to think about.

So on the vector side most of the operations fall into either in-place or out-of-place binary operations. These necessarily need separate kernels because the operation mode is different. Moreover, each of the kernels need to be able to accommodate certain combinations of sparsity of input arguments. The name of the kernel will need to disambiguate between all of these possible cases. On top of this, there are a number of other design considerations:

  1. The main concern will be constructing kernels to operate purely on sparse data. Obviously the keys that might appear in a purely sparse setting are only known once the computation is complete. This is not a problem in sequential programming because we can simply write all the values in to a map (or equivalent) whenever the operation yields a non-zero element. However, this is very tricky (though not necessarily impossible) on a device that does not operate sequentially (i.e. a GPU). One way around this might be to compute the array of possible key combinations on the CPU before delegating the actual calculation to GPU, but this obviously is not optimal. Any other sparse operation where the output is dense is rather trivial, and we might be able to reduce the total number of kernels by combing some of these together with the dense versions.
  2. Related to the first, there needs to be some algorithm for deciding when the output should be sparse or dense, depending on the input vectors. My current thoughts are that we should always promote the result to be dense whenever either of the input vectors is dense, though this might not always be reasonable. This has the added benefit of cutting down some combinations of kernel types since we won't need in-place kernels where the lhs is sparse and rhs dense - the lhs would be promoted and dense-dense would apply. There might be some better heuristics for deciding when to promote based on "density" but the simple algorithm will work for now.
  3. One perpetual problem we will face is setting the kernel launch parameters for a job. The way I've designed the kernels is that the kernel launch parameters are passed via an object to the kernel, which may then make some modifications to the parameters to fit within centrally configured or hardware launch constraints. Of course these might be device specific, so can only be handle by the kernel. Instead, the launch parameters that are passed should just describe the size of the job and the number of working dimensions; for vector operations, the dimension is always 1 since they are linear kernels. The only tricky part is dealing with sparse vectors, although there are some obvious choices. For instance, if the rhs is sparse but the result is dense (in-place operation) then the kernel parameters should reflect the rhs size, rather than the full size since this will be the total constraint.
  4. Resizing will also be an interesting problem to solve. When both vectors are dense, the max of the two inputs will be appropriate, but things get very complicated when dealing with sparse operations. Most sparse operations will yield at most lhs.size() + rhs.size() entries, although this will only be achieved if there is no overlap in numbers of keys. This does give us something to work with at least, but in general I would expect at least a small amount of overlap when both sizes are sufficiently large.

With all this in mind, I've come up with a rather simple interface to deal with all this. The Vector class needs three main methods:

class Vector {
    ...
    devices::Kernel get_kernel(OperationType kernel_type, string_view kernel_name, string_view suffix="") const;
    void check_and_resize_for_operands(const Vector& lhs, const Vector& rhs);
    void apply_binary_kernel(string_view kernel_name, const Vector& lhs, const Vector& rhs, optional<scalars::Scalar> multiplier {});
    ...
};

Here OperationType is an enumeration that describes the type of kernel that is to be retrieved - currently 6 variants are needed covering in-place or out-of-place unary, binary, or ternary operations. The get_kernel method needs to construct the name of the kernel to be retrieved in the form (for binary kernels at least)

<kernel_name>_<stype_id>_<sparse|dense>_<suffix>

The suffix is a string of letters s or d to signify sparse or dense arguments (and possibly other metadata in the future). The <sparse|dense> component refers to the output sparsity. Note that it is possible that the output might be dense whilst both arguments are sparse, even though this case won't be supported immediately. A full list of the macros I think we need is at the end of this post.

I originally coded OCL kernels in C using macro hell to provide the many different scalar type support. However, this was rather silly for a number of reasons. First it is impossible to read. Second, and more importantly, there are much better ways to do this. Ideally, one might use SYCL and pre-compilation (see separated builds in #73) to use C++ templates to define the kernels in a sane way. Alternatively, we can just use CMake to construct the files at build time, which requires very little extra work.

Here is a list of the vector kernels I think we need so far (name only, not including sparsity or scalar type variations).

  • copy (unary, out-of-place)
  • uminus (unary, out-of-place)
  • inplace_left_smul (unary, in-place, scalar arg)
  • inplace_right_smul (unary, in-place, scalar arg)
  • set_scalar (unary, in-place, scalar arg)
  • left_smul (unary, out-of-place, scalar arg)
  • right_smul (unary, out-of-place, scalar arg)
  • add_inplace (binary, in-place)
  • sub_inplace (binary, in-place)
  • add_scal_mul (binary, in-place, scalar arg)
  • sub_scal_mul (binary, in-place, scalar arg)
  • add (binary, out-of-place)
  • sub (binary, out-of-place)
  • min (binary, out-of-place, optional)
  • max (binary, out-of-place, optional)
  • equals (binary, out-of-place, returns bool)
  • less (binary, out-of-place, returns bool, optional)
  • greater (binary, out-of-place, returns bool, optional)

We don't need scalar divide functions, since we can actually implement these using their scalar-multiply counterparts without too much effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant