Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

Open
bradcray opened this issue May 2, 2023 · 9 comments

Comments

@bradcray
Copy link
Member

bradcray commented May 2, 2023

Though we don't have more than drafts of them yet, our plan is to support three different versions of each of our standard datatypes: One for serial computing, one for parallel single-locale, and one for distributed multi-locale. This issue asks two questions:

  • Should these various variants be stored in a single module or a module per type? (e.g., List(s) vs. List, ParList, DistList)

  • If the answer to the previous question is "a single module", should the name be singular or plural? (e.g., List vs. Lists)

@bradcray
Copy link
Member Author

bradcray commented May 2, 2023

For the first question, "a single module" seems pretty clearly preferable to me—both to keep the module name short and sweet, and because a single type per module seems unattractive to me.

For the second, my intuition would be to make it plural (since it would contain multiple lists), but I know sometimes people have different takes on that question. I'm going to tag @mppf specifically, who I think has asked this question at times (maybe even for one of these cases, maybe because at the time it only had one type in it).

@mppf
Copy link
Member

mppf commented May 3, 2023

I wouldn't expect these to be in a single module for the same reasons that the array distributions are not all in a single module. I would expect the different implementations (serial, parallel, distributed) to be very different. Another concern is that if they are combined, to some degree, compiling something that has e.g. use List will use more time to compile because the compiler has to process all of the stuff about the distributed / parallel versions, even if these were not desired. Additionally, I am imagining that these data structures will have different authors and maintainers, and that is easier to handle when they are in different files.

All that said, this is not a strong preference, and in particular, in the standard library I can see the argument for convenience / needing to choose fewer module names. Perhaps we would choose an in-between solution such as having submodules in different files for the different data structures & re-exporting them.

I wouldn't change the name of the List module to Lists over this. Even if List contains several list variants, I think it's acceptable to name it List. If we were starting from 0 maybe I'd pick Lists, but it seems to be a wash / arbitrary choice to me.

@bradcray
Copy link
Member Author

bradcray commented May 3, 2023

I wouldn't expect these to be in a single module... Additionally, I am imagining that these data structures will have different authors and maintainers, and that is easier to handle when they are in different files.

I was imagining an opposite argument to some of yours: That in trying to keep a coherent interface between the three variations on a type, having them in a single module would simplify that sort of maintenance rather than distributing it across files. I also imagine that there will likely be some shared code between at least some of the variations on these types. The compilation time point is concerning, but seems like something we should be able to optimize going forward, at least to an extent.

The comparison to distributions doesn't hold for me because we imagine an arbitrary number of distributions going forward, but only three flavors of these standard types.

Perhaps we would choose an in-between solution such as having submodules in different files for the different data structures & re-exporting them.

I'm open to that as well.

@lydia-duncan
Copy link
Member

The comparison to distributions doesn't hold for me because we imagine an arbitrary number of distributions going forward, but only three flavors of these standard types.

We were considering support for Cyclic and BlockCyclic-style variants for list

@bradcray
Copy link
Member Author

bradcray commented May 3, 2023

We were considering support for Cyclic and BlockCyclic-style variants for list

That still seems different to me—even if the standard modules support two flavors of distributed lists (whether as one type or two), that doesn't feel the same as the arbitrary number of distributions I expect arrays will support over time. To me, the better analogy to distributions would be "We don't want to put all collections in a single module because we'll support more and more collection types over time."

@lydia-duncan
Copy link
Member

I mean, depending on what other distributions we had in mind for arrays, I could see some of them being useful for lists as well. It just seems like it would be on a more case-by-case basis rather than applying to any distribution style.

Put another way, in a bold distributed list future where it feels like a fully blessed and integrated type that we support, I would hope that any new distribution would consider whether it should be applied to both arrays and lists rather than only thinking about arrays.

@lydia-duncan
Copy link
Member

Prep for an upcoming ad-hoc subteam (anticipated to start during the August 22nd sprint)

1. What is the API being presented?

This issue is about the module organization and naming of Map.chpl, List.chpl and Set.chpl, especially motivated by the anticipated addition of a parallel and distributed version of each.

Map.chpl:

module Map {
  …
  record map { … }
  …
}

List.chpl:

module List {
  …
  record list { … }
  …
}

Set.chpl:

module Set {
  …
  record set { … }
  …
}

There are two questions to resolve:
A. Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?
B. If we want them to live in the same module, should the module name be plural? E.g. Maps.chpl, etc.

How is it intended to be used?

The answer to this is really intrinsic to the answer to those questions.

  • Do we expect most users to want a mix of serial, parallel and distributed versions of each collection in their program? Or do we expect they’ll mostly use one style or the other?
    • Including more than the user is likely to need can contribute to:
      • Higher compilation times
      • Potentially more naming conflicts
        - Do we expect to support a variety of different implementations for the type in the future? Or do we expect the serial version we have stabilized today to be the only base serial version?
    • Even if we do expect to have multiple implementations of the serial one, do we expect users to use/import different modules to access the other implementations, or do we expect them to be able to swap between them without adjusting their use/import statements?

How is it being used in Arkouda and CHAMPS?

Map:

  • The module is used 15-16 times in Arkouda
    • in src/CommandMap.chpl
    • in src/FileIO.chpl
    • in src/HashMsg.chpl
    • in src/GenSymIO.chpl
    • in src/RandArray.chpl
    • in src/RegistrationMsg.chpl
    • in src/SegmentedMsg.chpl
    • in src/ParquetMsg.chpl
    • in src/StatsMsg.chpl
    • in src/TimeClassMsg.chpl
    • in src/MultiTypeSymbolTable.chpl
    • in src/Message.chpl
    • in src/HDF5Msg.chpl
    • in src/IndexingMsg.chpl
    • in src/MetricsMsg.chpl
    • in the Map compatibility module (which doesn’t really count on its own)
  • The module is used 6 times in CHAMPS
    • in meshTool/meshDeformation.chpl
    • in meshTool/src/globalSupportVolumeMeshDeformation.chpl
    • in meshTool/src/meshDeformationInputs.chpl
    • in potential/src/potentialMeshIO.chpl
    • in potential/src/potentialSystem.chpl
    • in potential/src/potentialPreprocessing.chpl

List:

  • The module is used 11-12 times in Arkouda
    • in src/Message.chpl
    • in src/BigIntMsg.chpl
    • in src/GenSymIO.chpl
    • in toys/LisExpr.chpl
    • in src/MetricsMsg.chpl
    • in src/CSVMsg.chpl
    • in src/RegistrationMsg.chpl
    • in src/ServerDaemon.chpl
    • in src/IndexingMsg.chpl
    • in src/ParquetMsg.chpl
    • in src/HDF5Msg.chpl
    • in the List compatibility module (which doesn’t really count on its own)
  • The module is used 20 times in CHAMPS
    • in postProcessor/postLink.chpl
    • in postProcessor/post.chpl
    • in EXT_LIBS/src/HDF5api.chpl
    • in EXT_LIBS/src/GCNSapi.chpl
    • in common/src/krylovPreconditioner.chpl
    • in drop/src/dropletModel.chpl
    • in potential/src/potentialMeshIO.chpl
    • in potential/src/spanwiseSections.chpl
    • in potential/src/potentialSystem.chpl
    • in potential/src/viscousCouple.chpl
    • in potential/src/potentialPreprocessing.chpl
    • in stochasticIcing/src/stochasticIcingSystem.chpl
    • in stochasticIcing/src/geometricTools.chpl
    • 2 times in stochasticIcing/src/dropletTrajectoryModel.chpl
    • in preProcessor/prep.chpl
    • in preProcessor/src/facetsWeighting.chpl
    • in preProcessor/src/ov_overlap.chpl
    • in preProcessor/src/holecut/xraysCollection.chpl
    • in preProcessor/src/holecut/xray.chpl

Set:

  • The module is used 10 times in Arkouda
    • in src/Logging.chpl
    • in src/GenSymIO.chpl
    • in src/CommandMap.chpl
    • in src/MultiTypeSymEntry.chpl
    • in src/CSVMsg.chpl
    • in src/RegistrationMsg.chpl
    • 3 times in src/OperatorMsg.chpl
    • in src/HDF5Msg.chpl
  • The module is used 4 times in CHAMPS
    • in postProcessor/post.chpl
    • in geo/src/geoModel.chpl
    • in EXT_LIBS/src/METISapi.chpl
    • in meshTool/src/hyperbolicGrid.chpl

2. What's the history of the feature, if it already exists?


The Set module was added in early August of 2019, and was originally named Sets. When we added the Map module in late August of 2019, we decided to rename it to match Map instead of the other way around.

The List module originally contained a simple linked list implementation and was added in 2007. When we made the LinkedList module in March of 2019, we deprecated the List module in favor of it. There was also a Lists module added in May of 2019. When we renamed the Sets module in August of 2019, we also renamed the Lists module to List and removed the deprecated List module.

The Map module was added in late August of 2019 to replace using associative arrays like maps. There was discussion around when it was added that led to us renaming the Lists and Sets modules and ensuring that we didn’t call the module Maps (#13749)

3. What's the precedent in other languages, if they support it?

Other languages either don’t support parallel or distributed versions, or if they do, they support them as separate types living in separate places. Several languages provide these collection types by default, which means they can be thought of as in the same place but not using a plural name to access them (or any module name really). When there are multiple implementations for a type, they typically live in their own location, though there are examples of living in the same module as another, more commonly used type and using that type’s name for the general module name.

a. Python

Python doesn’t handle these collection styles in the same way we do. Dictionaries are provided as a core part of the language.

b. C/C++

C/C++ don’t use the same namespace for headers and type names. So it’s a little apples-to-oranges of a comparison, but the map, list, and set types all are provided by <map> (https://cplusplus.com/reference/map/map/), <list> (https://cplusplus.com/reference/list/list/) and <set> (https://cplusplus.com/reference/set/set/) in C++. There are also <forward_list> (https://cplusplus.com/reference/forward_list/forward_list/), <unordered_map> (https://cplusplus.com/reference/unordered_map/unordered_map/), and <unordered_set> (https://cplusplus.com/reference/unordered_set/). Note that the latter two headers define multiple types (unordered_set and unordered_multiset) but still use the singular name.

c. Rust

Rust has a collections module which contains submodules for the individual collection types (https://doc.rust-lang.org/std/collections/index.html). There’s separate modules when a type has multiple implementations, e.g. btree_map (https://doc.rust-lang.org/std/collections/btree_map/index.html) and hash_map (https://doc.rust-lang.org/std/collections/hash_map/index.html).

Rust also has a crate for handling data parallelism on the collections (https://docs.rs/rayon/latest/rayon/), and a crate specifically for concurrent hash maps (https://docs.rs/chashmap/2.2.2/chashmap/index.html)

d. Swift

There’s a community-contributed Concurrent Collections package that contains a concurrent dictionary (https://github.com/peterprokop/SwiftConcurrentCollections/blob/master/Sources/SwiftConcurrentCollections/ConcurrentDictionary.swift) which is by definition in a different location than the dictionary type Swift normally provides.

e. Julia

I believe Julia provides its collections by default as part of the Base module. There are three different dictionary types defined there (https://docs.julialang.org/en/v1/base/collections/#Base.Dict, https://docs.julialang.org/en/v1/base/collections/#Base.IdDict and https://docs.julialang.org/en/v1/base/collections/#Base.WeakKeyDict), and two set types (https://docs.julialang.org/en/v1/base/collections/#Base.Set and https://docs.julialang.org/en/v1/base/collections/#Base.BitSet)

f. Go

Couldn’t find a set or list type, there are lots of packages for separate current map implementations and serial maps are provided by default.


4. Are there known Github issues with the feature?

5. Are there features it is related to? What impact would changing or adding this feature have on those other features?

The most natural feature to compare the collection types to is arrays and their distribution strategies. Today, the main crux of the array type is defined in modules/internal/ChapelArray.chpl (with domains being defined in modules/internal/ChapelDomain.chpl, ranges defined in modules/internal/ChapelRange.chpl, and the various distributions defined in modules/dists/ in their own file and a layout in modules/layouts/LayoutCS.chpl). In practice, this means that “parallel” arrays are defined in the same file as “serial” arrays, while any distributed array operation requires a separate use or import statement.

I don’t anticipate any changes we make here impacting our array organization for two reasons:

  • The base array type is defined by the language, in the internal modules. This organization is not user facing so doesn’t impact the organization of other types that are provided as libraries
  • I cannot envision a world where we try to put all the various distributions in a single file. Maybe (and it’s a big maybe), we re-export them all as submodules of another parent module that we introduce in the future. But even then, they probably should live in individual submodules

The other features to compare this to are:

  • modules/standard
    • Heap.chpl
  • modules/packages
    • Collection.chpl (which really defines an interface)
    • ConcurrentMap.chpl
    • DistributedBag.chpl
    • DistributedDeque.chpl
    • LinkedLists.chpl
    • LockFreeQueue.chpl
    • LockFreeStack.chpl
    • OrderedMap.chpl
    • OrderedSet.chpl
    • SortedMap.chpl
    • SortedSet.chpl
    • UnrolledLinkedList.chpl

None of that category of features have been stabilized today. Some are precursors to what we intend for the parallel or distributed version of the collection types (and so might go away). Some will be considered for stabilization in the future and so will probably be adjusted to be in line with whatever decision we make today when we are ready to stabilize them.

6. How do you propose to solve the problem?


For question A (Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?):

A1. Put the distributed and parallel versions in their own module when we add them

  • Pros:
    • Avoids bringing in the (likely extensive) additional symbols required to implement these versions in the case where a user only wants a serial version
    • Consistent with other languages that don’t provide the collections by default
    • Allows adding multiple implementations of the serial version without crowding the module
  • Cons:
    • May make sharing code a little more difficult

A2. Put the distributed and parallel versions in the same module as the serial implementation

  • Pros:
    • Makes sharing code easier
  • Cons:
    • Could get crowded, potentially increasing compile time
    • Not really consistent with other languages

For question B (If we want them to live in the same module, should the module name be plural?)

B1. yes

  • Pros:
    • More clearly implies that there are multiple versions in it
  • Cons:
    • No precedent for this in other languages, even with multiple serial implementations
    • We did this earlier, but moved away from it. Going back would mean additional churn and there were some users that were unhappy with it before (though that was with only a single type in the module)

B2. no

  • Pros:
    • Less impact on users
    • We already moved away from it before
    • There are examples of other languages having multiple versions in the same module/package/whatever without a plural name
  • Cons:
    • Doesn’t clearly indicate that there are multiple versions to choose from in it, users might fail to notice them.

I think they should live in separate modules. If we do put them in the same module, it seems reasonable to me to rename the module to the plural form, though.

@jeremiah-corrado
Copy link
Contributor

jeremiah-corrado commented Aug 21, 2023

I probably won't be able to attend the meeting, but here is my take on the two questions:

A. Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?

I'd prefer to have them in separate modules, primarily based on Michael's point above about compilation time (i.e., use Map would require the compiler to process all the versions of map).

A post 2.0 consideration: I'd prefer to have a more hierarchical module structure where other map types, for example, live in sub-modules within the Map module (ex: import Map.Parallel.parMap). This would keep things more organized IMO while alleviating the compilation time concern and the concern about each of the collection modules growing arbitrarily large due to the addition of more types over time.

B. If we want them to live in the same module, should the module name be plural? E.g. Maps.chpl, etc.

I don't think pluralizing is necessary from an aesthetic perspective. I.e., the following is not offensive to me:

import Map.parallelMap;

Doesn’t clearly indicate that there are multiple versions to choose from in it, users might fail to notice them

I think this concern could be alleviated by summarizing the various collection types and providing links to them at the beginning of the module's documentation. The name of the module itself still wouldn't indicate that there are multiple types within but I don't think that particularly matters, as a new Chapel programmer who is looking for a Map collection — parallel, distributed or otherwise — would likely check in the Map module first, and then see that there are multiple collections there.

Generally, I don't think the churn caused by renaming such common modules would be worthwhile in this case.

@lydia-duncan
Copy link
Member

In our meeting today, we decided that the additional implementations of a particular collection would live in their own module. We checked with Brad offline and he was okay with this decision.

We did poll about if we would make the name plural if we did include them, and decided against doing so.

Meeting notes:

  • Shreyas: it looks like there's a Collections interface, do the 2.0-eligible collections use it?
    • Lydia: no, and I expect it will be adjusted to be compliant with them if we do stabilize it
  • Ben M: strong preference for own modules
  • Shreyas: what impact would adding the type later have?
    • Lydia: any blanket uses of the module that they live in would bring it in, e.g.:
module User {
  use List; // not `use List only list;` or `import List.list;`

  var parallelList: list;
  proc foo() {
    // attempts to use `parallelList` would maybe get confused?
  }
}
  • Ben: not having the parallel/distributed implementations stable for 2.0 makes me more inclined to have separate modules
  • Ben: doesn't feel strongly either way
  • Shreyas: stronger preference. Prefers submodules to standalone modules
  • Ben: you can just rename the type. Would like to import Map.parallelMap as map;, does prefer that to changing the module name as well but doesn't feel strongly

A1. votes: (Lydia, Shreyas in favor; Ben M strongly in favor)
A2. votes: (Lydia, Ben: support with concerns)

B1. votes: (Ben: strongly opposed)
B2. votes: (Shreyas, Ben: strongly in favor; Lydia in favor)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants