Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

bradcray · 2023-05-02T21:31:30Z

Though we don't have more than drafts of them yet, our plan is to support three different versions of each of our standard datatypes: One for serial computing, one for parallel single-locale, and one for distributed multi-locale. This issue asks two questions:

Should these various variants be stored in a single module or a module per type? (e.g., List(s) vs. List, ParList, DistList)
If the answer to the previous question is "a single module", should the name be singular or plural? (e.g., List vs. Lists)

The text was updated successfully, but these errors were encountered:

bradcray · 2023-05-02T21:33:35Z

For the first question, "a single module" seems pretty clearly preferable to me—both to keep the module name short and sweet, and because a single type per module seems unattractive to me.

For the second, my intuition would be to make it plural (since it would contain multiple lists), but I know sometimes people have different takes on that question. I'm going to tag @mppf specifically, who I think has asked this question at times (maybe even for one of these cases, maybe because at the time it only had one type in it).

mppf · 2023-05-03T14:34:36Z

I wouldn't expect these to be in a single module for the same reasons that the array distributions are not all in a single module. I would expect the different implementations (serial, parallel, distributed) to be very different. Another concern is that if they are combined, to some degree, compiling something that has e.g. use List will use more time to compile because the compiler has to process all of the stuff about the distributed / parallel versions, even if these were not desired. Additionally, I am imagining that these data structures will have different authors and maintainers, and that is easier to handle when they are in different files.

All that said, this is not a strong preference, and in particular, in the standard library I can see the argument for convenience / needing to choose fewer module names. Perhaps we would choose an in-between solution such as having submodules in different files for the different data structures & re-exporting them.

I wouldn't change the name of the List module to Lists over this. Even if List contains several list variants, I think it's acceptable to name it List. If we were starting from 0 maybe I'd pick Lists, but it seems to be a wash / arbitrary choice to me.

bradcray · 2023-05-03T17:56:53Z

I wouldn't expect these to be in a single module... Additionally, I am imagining that these data structures will have different authors and maintainers, and that is easier to handle when they are in different files.

I was imagining an opposite argument to some of yours: That in trying to keep a coherent interface between the three variations on a type, having them in a single module would simplify that sort of maintenance rather than distributing it across files. I also imagine that there will likely be some shared code between at least some of the variations on these types. The compilation time point is concerning, but seems like something we should be able to optimize going forward, at least to an extent.

The comparison to distributions doesn't hold for me because we imagine an arbitrary number of distributions going forward, but only three flavors of these standard types.

Perhaps we would choose an in-between solution such as having submodules in different files for the different data structures & re-exporting them.

I'm open to that as well.

lydia-duncan · 2023-05-03T18:20:30Z

The comparison to distributions doesn't hold for me because we imagine an arbitrary number of distributions going forward, but only three flavors of these standard types.

We were considering support for Cyclic and BlockCyclic-style variants for list

bradcray · 2023-05-03T18:24:55Z

We were considering support for Cyclic and BlockCyclic-style variants for list

That still seems different to me—even if the standard modules support two flavors of distributed lists (whether as one type or two), that doesn't feel the same as the arbitrary number of distributions I expect arrays will support over time. To me, the better analogy to distributions would be "We don't want to put all collections in a single module because we'll support more and more collection types over time."

lydia-duncan · 2023-05-03T19:23:31Z

I mean, depending on what other distributions we had in mind for arrays, I could see some of them being useful for lists as well. It just seems like it would be on a more case-by-case basis rather than applying to any distribution style.

Put another way, in a bold distributed list future where it feels like a fully blessed and integrated type that we support, I would hope that any new distribution would consider whether it should be applied to both arrays and lists rather than only thinking about arrays.

lydia-duncan · 2023-08-18T17:04:53Z

Prep for an upcoming ad-hoc subteam (anticipated to start during the August 22nd sprint)

1. What is the API being presented?

This issue is about the module organization and naming of Map.chpl, List.chpl and Set.chpl, especially motivated by the anticipated addition of a parallel and distributed version of each.

Map.chpl:

module Map {
  …
  record map { … }
  …
}

List.chpl:

module List {
  …
  record list { … }
  …
}

Set.chpl:

module Set {
  …
  record set { … }
  …
}

There are two questions to resolve:
A. Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?
B. If we want them to live in the same module, should the module name be plural? E.g. Maps.chpl, etc.

How is it intended to be used?

The answer to this is really intrinsic to the answer to those questions.

Do we expect most users to want a mix of serial, parallel and distributed versions of each collection in their program? Or do we expect they’ll mostly use one style or the other?
- Including more than the user is likely to need can contribute to:
  - Higher compilation times
  - Potentially more naming conflicts
    - Do we expect to support a variety of different implementations for the type in the future? Or do we expect the serial version we have stabilized today to be the only base serial version?
- Even if we do expect to have multiple implementations of the serial one, do we expect users to use/import different modules to access the other implementations, or do we expect them to be able to swap between them without adjusting their use/import statements?

How is it being used in Arkouda and CHAMPS?

Map:

The module is used 15-16 times in Arkouda
- in src/CommandMap.chpl
- in src/FileIO.chpl
- in src/HashMsg.chpl
- in src/GenSymIO.chpl
- in src/RandArray.chpl
- in src/RegistrationMsg.chpl
- in src/SegmentedMsg.chpl
- in src/ParquetMsg.chpl
- in src/StatsMsg.chpl
- in src/TimeClassMsg.chpl
- in src/MultiTypeSymbolTable.chpl
- in src/Message.chpl
- in src/HDF5Msg.chpl
- in src/IndexingMsg.chpl
- in src/MetricsMsg.chpl
- in the Map compatibility module (which doesn’t really count on its own)
The module is used 6 times in CHAMPS
- in meshTool/meshDeformation.chpl
- in meshTool/src/globalSupportVolumeMeshDeformation.chpl
- in meshTool/src/meshDeformationInputs.chpl
- in potential/src/potentialMeshIO.chpl
- in potential/src/potentialSystem.chpl
- in potential/src/potentialPreprocessing.chpl

List:

The module is used 11-12 times in Arkouda
- in src/Message.chpl
- in src/BigIntMsg.chpl
- in src/GenSymIO.chpl
- in toys/LisExpr.chpl
- in src/MetricsMsg.chpl
- in src/CSVMsg.chpl
- in src/RegistrationMsg.chpl
- in src/ServerDaemon.chpl
- in src/IndexingMsg.chpl
- in src/ParquetMsg.chpl
- in src/HDF5Msg.chpl
- in the List compatibility module (which doesn’t really count on its own)
The module is used 20 times in CHAMPS
- in postProcessor/postLink.chpl
- in postProcessor/post.chpl
- in EXT_LIBS/src/HDF5api.chpl
- in EXT_LIBS/src/GCNSapi.chpl
- in common/src/krylovPreconditioner.chpl
- in drop/src/dropletModel.chpl
- in potential/src/potentialMeshIO.chpl
- in potential/src/spanwiseSections.chpl
- in potential/src/potentialSystem.chpl
- in potential/src/viscousCouple.chpl
- in potential/src/potentialPreprocessing.chpl
- in stochasticIcing/src/stochasticIcingSystem.chpl
- in stochasticIcing/src/geometricTools.chpl
- 2 times in stochasticIcing/src/dropletTrajectoryModel.chpl
- in preProcessor/prep.chpl
- in preProcessor/src/facetsWeighting.chpl
- in preProcessor/src/ov_overlap.chpl
- in preProcessor/src/holecut/xraysCollection.chpl
- in preProcessor/src/holecut/xray.chpl

Set:

The module is used 10 times in Arkouda
- in src/Logging.chpl
- in src/GenSymIO.chpl
- in src/CommandMap.chpl
- in src/MultiTypeSymEntry.chpl
- in src/CSVMsg.chpl
- in src/RegistrationMsg.chpl
- 3 times in src/OperatorMsg.chpl
- in src/HDF5Msg.chpl
The module is used 4 times in CHAMPS
- in postProcessor/post.chpl
- in geo/src/geoModel.chpl
- in EXT_LIBS/src/METISapi.chpl
- in meshTool/src/hyperbolicGrid.chpl

2. What's the history of the feature, if it already exists? 

The Set module was added in early August of 2019, and was originally named Sets. When we added the Map module in late August of 2019, we decided to rename it to match Map instead of the other way around.

The List module originally contained a simple linked list implementation and was added in 2007. When we made the LinkedList module in March of 2019, we deprecated the List module in favor of it. There was also a Lists module added in May of 2019. When we renamed the Sets module in August of 2019, we also renamed the Lists module to List and removed the deprecated List module.

The Map module was added in late August of 2019 to replace using associative arrays like maps. There was discussion around when it was added that led to us renaming the Lists and Sets modules and ensuring that we didn’t call the module Maps (#13749)

3. What's the precedent in other languages, if they support it?

Other languages either don’t support parallel or distributed versions, or if they do, they support them as separate types living in separate places. Several languages provide these collection types by default, which means they can be thought of as in the same place but not using a plural name to access them (or any module name really). When there are multiple implementations for a type, they typically live in their own location, though there are examples of living in the same module as another, more commonly used type and using that type’s name for the general module name.

a. Python

Python doesn’t handle these collection styles in the same way we do. Dictionaries are provided as a core part of the language.

b. C/C++

C/C++ don’t use the same namespace for headers and type names. So it’s a little apples-to-oranges of a comparison, but the map, list, and set types all are provided by <map> (https://cplusplus.com/reference/map/map/), <list> (https://cplusplus.com/reference/list/list/) and <set> (https://cplusplus.com/reference/set/set/) in C++. There are also <forward_list> (https://cplusplus.com/reference/forward_list/forward_list/), <unordered_map> (https://cplusplus.com/reference/unordered_map/unordered_map/), and <unordered_set> (https://cplusplus.com/reference/unordered_set/). Note that the latter two headers define multiple types (unordered_set and unordered_multiset) but still use the singular name.

c. Rust

Rust has a collections module which contains submodules for the individual collection types (https://doc.rust-lang.org/std/collections/index.html). There’s separate modules when a type has multiple implementations, e.g. btree_map (https://doc.rust-lang.org/std/collections/btree_map/index.html) and hash_map (https://doc.rust-lang.org/std/collections/hash_map/index.html).

Rust also has a crate for handling data parallelism on the collections (https://docs.rs/rayon/latest/rayon/), and a crate specifically for concurrent hash maps (https://docs.rs/chashmap/2.2.2/chashmap/index.html)

d. Swift

There’s a community-contributed Concurrent Collections package that contains a concurrent dictionary (https://github.com/peterprokop/SwiftConcurrentCollections/blob/master/Sources/SwiftConcurrentCollections/ConcurrentDictionary.swift) which is by definition in a different location than the dictionary type Swift normally provides.

e. Julia

I believe Julia provides its collections by default as part of the Base module. There are three different dictionary types defined there (https://docs.julialang.org/en/v1/base/collections/#Base.Dict, https://docs.julialang.org/en/v1/base/collections/#Base.IdDict and https://docs.julialang.org/en/v1/base/collections/#Base.WeakKeyDict), and two set types (https://docs.julialang.org/en/v1/base/collections/#Base.Set and https://docs.julialang.org/en/v1/base/collections/#Base.BitSet)

f. Go

Couldn’t find a set or list type, there are lots of packages for separate current map implementations and serial maps are provided by default. 

4. Are there known Github issues with the feature?

Map:
- Only with the type itself
  - Convenience initializers for map and set #22776 asks for some additional convenience initializers
    - Also applies to set
  - Internal error in cullOverReferences.cpp adding record-wrapped class to a map #21193 tracks a compiler bug exposed by using maps
  - Nil dereference for maps of domains #20167 tracks an issue with storing domains in a map
  - Error trying to read map from file #20004 tracks an issue with reading a map from a file
  - [Patterns] should it be possible to specify the locality of a distributed map's keys? How? #18964 is a design discussion about handling locality in a distributed map
  - [Patterns] should we support combining distributed maps? What should that look like if we do? #18963 is a design discussion about combining distributed maps
  - Internal error using first class functions in map #18736 tracks a compiler bug exposed by using maps with FCFs
  - Using List and Map of user classes in a Record fails to compile #18005 tracks a compiler bug exposed by using maps
    - Also applies to list
  - Update Map documentation to note that modifying a map while iterating is illegal #16588 requests an update to the documentation
  - Map of class problems when map.table not default initialized #15960 tracks an implementation issue that was worked around
  - Add procedure merge(m: Map) to the Map module. #15313 requests adding a merge method
  - What should map.these() yield? #14718 talks about the behavior for the these method
  - Should (some) parallel-safe collection types use 'ref' task intents by default? #18876 question about if there should be a way for types like map to opt in to using ref task intents
    - Also applies to list
  - Cannot use class type in Map value type #14423 tracks some confusion a user encountered
  - Collections 2.0: Collections & Distributed Data Structures Overhaul #8435 mentions a desire for a unidirectional and bidirectional map (though this was opened before the Map module was added, so it would be good to check if that has been fulfilled - I’m not familiar off hand with exactly what that means in this context)
  - [Patterns] Should distributed maps support parallel updates to values? #18962 asks about supporting parallel updates
List:
- Only with the type itself
  - Class containing two lists with owned elements fails to compile #22261 tracks a bug with storing lists of classes in a class
  - [feature request] Reverse method on list #19997 requests a “reverse” method on the type
  - [Patterns] How should elements be added to a distributed list? #18967 is a design discussion about how to add elements to a distributed list
  - Should (some) parallel-safe collection types use 'ref' task intents by default? #18876 also applies to list
  - Documentation improvements for the List module #18111 tracks some proposed documentation improvements
  - Complete list addition methods' overloads with respect to scalar and non-scalar arguments #18102 proposes some additions to the type for completeness (though it would benefit from a recent look at the state of things)
  - Using List and Map of user classes in a Record fails to compile #18005 also applies to list
  - Internal error: list of lists passed to proc #17319 tracks a compiler error with a list of lists
  - const checking error with list.first and list.last #17259 tracks a const checking error with list.first and list.last
  - should string.join accept a list? #17252 is a feature request for string.join
  - should list be assignable from a list of different but compatible type? #16715 is a feature request about allowing assignments from lists of different type
  - should we be able to assign to a list from array or range? #16714 is a feature request about assigning to a list from an array or range
  - Given a collection C, should we support isC[Type|Value] queries? #16171 is a feature request for isListType, etc.
  - Unified interface for List and so called Vector #15913 talks about the interface for both list and the vector type
  - Support exporting Chapel lists as Python lists #15423 tracks a desire to convert lists naturally to Python’s list type when interoperating.
  - Add procedure + overload for List and LinkedList #15373 requests + for lists and linked lists
  - Problems with lists of tuples of non-nil classes #15182 tracks a bug with lists of tuples of non-nil classes
  - New 'Lists' module is 5x slower than dynamic array, 2x slower than 'Array.push_back' with '--local' #13652 calls out a performance issue with the module
  - Bulk Transfer for 'list' data type #13583 talks about bulk moving a list across locales
  - Should we deprecate list.sort? #18100 talks about if we should continue to support list.sort (looks like this slipped through the 2.0 cracks, Jeremiah has volunteered to help deprecate that)
Set:
- Convenience initializers for map and set #22776 also applies to set
- Promoted add operation on set errors with "Racy promotion of scalar method receiver" #21256 tracks an error message about add called with an array
- Should the 'set' type support a method to get the only element in the set? #21166 requests a method for getting an element out of the set if it’s the only element in the set
- can't create a set of sets #19156 tracks a bug with creating a set of sets
- [Patterns] What should the result be if two distributed sets are equal except values are on different locales? #18965 is a design discussion about equality between distributed sets
- Should we have both operators and named functions/methods for set operations? #18654 requests additional methods
- Implement a method similar to the Python set.pop() #18652 requests an additional method
- Implement a new remove method for set that returns the item removed #18649 requests an additional method
- Zippering of two sets can fail even when they are the same length #15824 tracks a bug with zippering two sets

5. Are there features it is related to? What impact would changing or adding this feature have on those other features?

The most natural feature to compare the collection types to is arrays and their distribution strategies. Today, the main crux of the array type is defined in modules/internal/ChapelArray.chpl (with domains being defined in modules/internal/ChapelDomain.chpl, ranges defined in modules/internal/ChapelRange.chpl, and the various distributions defined in modules/dists/ in their own file and a layout in modules/layouts/LayoutCS.chpl). In practice, this means that “parallel” arrays are defined in the same file as “serial” arrays, while any distributed array operation requires a separate use or import statement.

I don’t anticipate any changes we make here impacting our array organization for two reasons:

The base array type is defined by the language, in the internal modules. This organization is not user facing so doesn’t impact the organization of other types that are provided as libraries
I cannot envision a world where we try to put all the various distributions in a single file. Maybe (and it’s a big maybe), we re-export them all as submodules of another parent module that we introduce in the future. But even then, they probably should live in individual submodules

The other features to compare this to are:

modules/standard
- Heap.chpl
modules/packages
- Collection.chpl (which really defines an interface)
- ConcurrentMap.chpl
- DistributedBag.chpl
- DistributedDeque.chpl
- LinkedLists.chpl
- LockFreeQueue.chpl
- LockFreeStack.chpl
- OrderedMap.chpl
- OrderedSet.chpl
- SortedMap.chpl
- SortedSet.chpl
- UnrolledLinkedList.chpl

None of that category of features have been stabilized today. Some are precursors to what we intend for the parallel or distributed version of the collection types (and so might go away). Some will be considered for stabilization in the future and so will probably be adjusted to be in line with whatever decision we make today when we are ready to stabilize them.

6. How do you propose to solve the problem? 

For question A (Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?):

A1. Put the distributed and parallel versions in their own module when we add them

Pros:
- Avoids bringing in the (likely extensive) additional symbols required to implement these versions in the case where a user only wants a serial version
- Consistent with other languages that don’t provide the collections by default
- Allows adding multiple implementations of the serial version without crowding the module
Cons:
- May make sharing code a little more difficult

A2. Put the distributed and parallel versions in the same module as the serial implementation

Pros:
- Makes sharing code easier
Cons:
- Could get crowded, potentially increasing compile time
- Not really consistent with other languages

For question B (If we want them to live in the same module, should the module name be plural?)

B1. yes

Pros:
- More clearly implies that there are multiple versions in it
Cons:
- No precedent for this in other languages, even with multiple serial implementations
- We did this earlier, but moved away from it. Going back would mean additional churn and there were some users that were unhappy with it before (though that was with only a single type in the module)

B2. no

Pros:
- Less impact on users
- We already moved away from it before
- There are examples of other languages having multiple versions in the same module/package/whatever without a plural name
Cons:
- Doesn’t clearly indicate that there are multiple versions to choose from in it, users might fail to notice them.

I think they should live in separate modules. If we do put them in the same module, it seems reasonable to me to rename the module to the plural form, though.

jeremiah-corrado · 2023-08-21T15:18:40Z

I probably won't be able to attend the meeting, but here is my take on the two questions:

A. Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?

I'd prefer to have them in separate modules, primarily based on Michael's point above about compilation time (i.e., use Map would require the compiler to process all the versions of map).

A post 2.0 consideration: I'd prefer to have a more hierarchical module structure where other map types, for example, live in sub-modules within the Map module (ex: import Map.Parallel.parMap). This would keep things more organized IMO while alleviating the compilation time concern and the concern about each of the collection modules growing arbitrarily large due to the addition of more types over time.

B. If we want them to live in the same module, should the module name be plural? E.g. Maps.chpl, etc.

I don't think pluralizing is necessary from an aesthetic perspective. I.e., the following is not offensive to me:

import Map.parallelMap;

Doesn’t clearly indicate that there are multiple versions to choose from in it, users might fail to notice them

I think this concern could be alleviated by summarizing the various collection types and providing links to them at the beginning of the module's documentation. The name of the module itself still wouldn't indicate that there are multiple types within but I don't think that particularly matters, as a new Chapel programmer who is looking for a Map collection — parallel, distributed or otherwise — would likely check in the Map module first, and then see that there are multiple collections there.

Generally, I don't think the churn caused by renaming such common modules would be worthwhile in this case.

lydia-duncan · 2023-08-22T21:58:00Z

In our meeting today, we decided that the additional implementations of a particular collection would live in their own module. We checked with Brad offline and he was okay with this decision.

We did poll about if we would make the name plural if we did include them, and decided against doing so.

Meeting notes:

Shreyas: it looks like there's a Collections interface, do the 2.0-eligible collections use it?
- Lydia: no, and I expect it will be adjusted to be compliant with them if we do stabilize it
Ben M: strong preference for own modules
Shreyas: what impact would adding the type later have?
- Lydia: any blanket uses of the module that they live in would bring it in, e.g.:

module User {
  use List; // not `use List only list;` or `import List.list;`

  var parallelList: list;
  proc foo() {
    // attempts to use `parallelList` would maybe get confused?
  }
}

Ben: not having the parallel/distributed implementations stable for 2.0 makes me more inclined to have separate modules
Ben: doesn't feel strongly either way
Shreyas: stronger preference. Prefers submodules to standalone modules
Ben: you can just rename the type. Would like to import Map.parallelMap as map;, does prefer that to changing the module name as well but doesn't feel strongly

A1. votes: (Lydia, Shreyas in favor; Ben M strongly in favor)
A2. votes: (Lydia, Ben: support with concerns)

B1. votes: (Ben: strongly opposed)
B2. votes: (Shreyas, Ben: strongly in favor; Lydia in favor)

bradcray added type: Design area: Libraries / Modules type: Stabilization labels May 2, 2023

lydia-duncan removed the type: Stabilization label Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

bradcray commented May 2, 2023

bradcray commented May 2, 2023

mppf commented May 3, 2023

bradcray commented May 3, 2023

lydia-duncan commented May 3, 2023

bradcray commented May 3, 2023

lydia-duncan commented May 3, 2023

lydia-duncan commented Aug 18, 2023

jeremiah-corrado commented Aug 21, 2023 •

edited

Loading

lydia-duncan commented Aug 22, 2023

Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

Should all the flavors (serial, parallel, distributed of a given std datatype (list, set, map) be defined in a single module? #22218

Comments

bradcray commented May 2, 2023

bradcray commented May 2, 2023

mppf commented May 3, 2023

bradcray commented May 3, 2023

lydia-duncan commented May 3, 2023

bradcray commented May 3, 2023

lydia-duncan commented May 3, 2023

lydia-duncan commented Aug 18, 2023

1. What is the API being presented?

How is it intended to be used?

How is it being used in Arkouda and CHAMPS?

2. What's the history of the feature, if it already exists?

3. What's the precedent in other languages, if they support it?

4. Are there known Github issues with the feature?

5. Are there features it is related to? What impact would changing or adding this feature have on those other features?

6. How do you propose to solve the problem?

jeremiah-corrado commented Aug 21, 2023 • edited Loading

A. Do we want the distributed and parallel versions to live in the same module as the serial version, or a separate one?

B. If we want them to live in the same module, should the module name be plural? E.g. Maps.chpl, etc.

lydia-duncan commented Aug 22, 2023

2. What's the history of the feature, if it already exists? 

6. How do you propose to solve the problem? 

jeremiah-corrado commented Aug 21, 2023 •

edited

Loading