Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data segmentation API framework #1539

Open
na-- opened this issue Jul 8, 2020 · 10 comments
Open

Data segmentation API framework #1539

na-- opened this issue Jul 8, 2020 · 10 comments
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature

Comments

@na--
Copy link
Member

na-- commented Jul 8, 2020

We now have support for partitioning work (i.e. VUs and iterations) between multiple k6 instances, via the executionSegment and executionSegmentSequence options, originally described in #997 and subsequently evolved in #1007. In the end, we had to implement striping even in the initial version (thus, the need for executionSegmentSequence 😞), because some executors like ramping-vus and the arrival-rate ones needed it for optimal performance.

The good news from that extra effort though, is that we now have most of the things we need to tackle data segmentation/partitioning between multiple k6 instances, without any external runtime scheduling between them. We need to refactor and improve some things in the initial implementation, like #1499, #1427, and #1386, but the rough building blocks are already here... 🎉 I'm making this issue as a place to discuss this effort, so that I can close #997, given that most of it is done.

Of course, we don't need to start implementing this right away. We "just" have to figure out how its JS API and options should look like... 😅 This will allow us to start implementing things like a streaming data support (#592), shared read-only memory (#532), CSV API (#1021), a JSONPath API (#992), and an XML parsing API with XPath support. I don't think binary data handling (#1020) is going to be affected by this, but it probably deserves some thought as well.

If we have a clear idea how the data segmentation, we can start implementing the issues above without the complicated data segmentation in their first versions, knowing we'd be able to add it at a later point, hopefully without having to completely refactor everything again. Currently, I think we can split the process like this:

  1. Figure out how how data segmentation should look like (this issue)
  2. Start implementing initial versions, without segmentation, of a streaming/shared data, CSV API, JSONPath, etc.
  3. Somewhere in the middle of 2, implement a simple JS API that basically provides segmented iterators, i.e. a new k6 JS API or APIs that provide iterators for which iter.next() returns the next item in a segmented and/or striped fashion. This, combined making sure that the new APIs from point 2. are compatible with these iterators, since this will immediately alow users to have data segmentation, albeit with a little bit of JS work and some minor loss of performance.
  4. Make sure that in the final version, everything is composable. For example, in my ideal UX scenario, it should be possible and natural/easy for users to make a segmented CSV reader on top of a shared/streaming data source that just works, while also being able to use any one of these 3 things individually 😅
@na-- na-- added enhancement feature evaluation needed proposal needs to be validated or tested before fully implementing it in k6 labels Jul 8, 2020
@na-- na-- added this to the v0.28.0 milestone Jul 8, 2020
@na--
Copy link
Member Author

na-- commented Jul 15, 2020

https://community.k6.io/t/how-to-distribute-vus-across-different-scenarios-with-k6/49/11 is another very common use case we have to take into account when designing this API. Again, having the above interfaces composable would be key, since then we should be easily able to make a clean helper function that solves the following use case:

We have N sets of credentials for our webapp/service/etc., and we want to spin up N VUs, each VU using consistently using one one of these sets of credentials to make its requests.

@na-- na-- modified the milestones: v0.28.0, v0.30.0 Sep 9, 2020
This was referenced Oct 9, 2020
@na-- na-- modified the milestones: v0.30.0, v0.31.0 Nov 30, 2020
@na--
Copy link
Member Author

na-- commented Nov 30, 2020

Here's a use case that we should take into account when we implement this: https://community.k6.io/t/unique-test-data-per-vu-without-reserving-data-upfront/1136/5

Basically, something like a {executor: "shared-iterations", iterations: X, vus: Y} scenario will probably be enough for it, if we had an iterator that can tell us on which iteration out of the X configured ones we are currently on. This should be fairly easy to do, it might not even be the purview of this issue (generic segmented iterators), but rather #1320.

@robingustafsson
Copy link
Member

robingustafsson commented Apr 22, 2021

Data segmentation proposal

We should strive to implement it separating concerns of loading, parsing and accessing data.

The flow of data is as follows, composed of different parts that can be swapped out to meet the required properties:

  1. Load data from "Source" (file, string, HTTP response in setup() function, etc.)
  2. Parse data into a SharedTable (a SharedArray where columns can be referred to by name and access to rows is proxied through a method to make sure row selection rules are followed)
  3. Access data in table according to desired "Consumption Pattern"

Constraints:

  • Needs to be compatible with all three execution modes of k6: local, cloud and clustered (Clustering and distributed execution #140)
  • Good developer experience, which IMO means no extra configuration or external programs but rather relying on existing information like execution segments to manage data and segmentation between nodes
  • Support the following use cases:
    • Allow (pseudo-)sequential access to data rows
    • Allow random access to data rows
    • Allow data rows to be used once and once only
    • Allow sticky data rows, same data pulled randomly from set should be returned when called again from same VU

k6 responsibilities:

  • Providing an API to load a local file into memory (the open() API)
  • Providing an API to parse a few common textual file formats like CSV, JSON and XML
  • Providing an API to share data/memory across VUs (the existing SharedArray and proposed SharedTable APIs)
  • Providing APIs to access shared tabular data while obeying certain properties such as random access, unique access and sticky-per-VU, all while letting the user be ignorant to how the test is being executed (local, cloud or clustered)
  • Figure out based on execution segments what segment of the data to use when (if necsssary)
  • Provide user with API to bail/abort a VU if it runs out of data rows

User responsibilities:

  • Making sure there's enough data in the data source to run the full length of the test
  • Makign sure the data source can entirely fit into memory once (the use case of huge files and streaming APIs is a separate topic IMO)

SharedTable

A structure with a tabular format, N rows and M columns:

Username Password ...
Data Data ...
Data Data ...
... ... ...

An API as follows is proposed based on Consumption Patterns described below:

let table = new SharedTable("some name", function() {
        // Load data from "Source"
        ...
        return {
            columns: ['Username', 'Password'],
            rows: [...]
        }
    }, {
        rowSelection: 'sequential'|'random'|'unique',
        updateRow: 'each-iteration'|'each-access'|'once',
        whenOutOfRows: 'abort-vu'|'continue-in-cycle'|'use-last-row',
        segment: true|false|Number
    });

export default function() {
    let row = table.selectRow();
    console.log(row[0], row['Username']);

    // Accessing a *column value* according to specified data Consumption Pattern could also be handled
    // by proxying the access through the index operator, but I think this might be to much magic and could
    // introduce unnecessary cognitive burden on user trying to understand someone else's script, or?
    console.log(table[0], table['Username']);
}

A SharedTable could also be contructed from a SharedArray or other Array like object:

let arrayLikeObject = ...;
let table = SharedTable.from(arrayLikeObject, ['Column 1', 'Columns 2', ...]):

Consumption Pattern

Different testing use cases call for differences in desired consumption of data when parameterizing actions in a test. The user should have control of some parameters for how data should be consumed when requested by a VU. These are parameters are heavily influenced by the options available in LoadRunner:

  1. Row selection pattern: Controls how to select data rows during the test when multiple VUs are running (and each VU often running for multiple iterations). There are three options:
  • Sequential: data rows are consumed sequentially from Source
  • Random: data rows consumed randomly from Source
  • Unique: data rows should be selected uniquely from Source
  1. Update row: Controls when VU should update the data row its consuming data from. Again there are three options:
  • Each iteration: update the data row selected by VU at the start of each iteration
  • Each access: update the data row selected by VU on each access to the data row
  • Once: update the data row only once
  1. When out of rows

For some combinations of 1) and 2), namely "Unique+Each iteration" and "Unique+Each access", there's a third option. When the VU runs out of unique data rows, what should happen? There are a three options (again :)):

  • Abort VU: when VU runs out of data rows (after last row has been accessed) it is aborted and stops executing any further iterations
  • Continue in cyclical selection pattern: VU will continue executing iterations but will recycle data rows in a cyclical pattern
  • Continue with last row: continue executing VU but each data row access would now always return the last row before we ran out of rows
  1. Segmentation

Sometimes when data files are really big it can make sense to segment the data when running tests spanning multiple load gens machines. There are three options (what's up with the 3 options? :smiling:):

  • true: segment data proportionally according to execution segments
  • false: don't segment data, all data is available on all load gen machines
  • Number: user decides how many rows each VU should have access to (I think deciding per VU would make most sense from a user perspective, rather than per load gen machine which is something the user might be completely abstracted away from, say in k6 Cloud)

The combination of these parameters defines what data rows (from the SharedTable) are to be selected by each VU for each iteration and each access/row selection (within an iteration):

  • Sequential+Each iteration: each VU starts from the top of the SharedTable, the first row, and iterates over rows top to bottom selecting a new row at the start of each iteration.

  • Sequential+Each access: each VU starts from the top of the SharedTable, the first row, and iterates over rows top to bottom selecting a new row every time the VU requests data.

  • Sequential+Once: each VU always uses data from the first data row. No other data rows would be used.

  • Random+Each iteration: each VU selects a random row for each iteration, each call to table.selectRow() return the same row for the duration of a full VU iteration.

  • Random+Each iteration: each VU selects a random row each time a call to table.selectRow() is made.

  • Random+Once: each VU selects a random row once in the first call to table.selectRow() and any subsequent call to table.selectRow() selects the same row for all VU iterations.

  • Unique+Each iteration: each VU selects a unique (previosuly unused) row for each iteration, each call to table.selectRow() return the same row for the duration of a full VU iteration. When out of rows the whenOutOfRows option tells us what to do.

  • Unique+Each iteration: each VU selects a unique (previosuly unused) row each time a call to table.selectRow() is made. When out of rows the whenOutOfRows option tells us what to do.

  • Unique+Once: each VU selects a unique (previously unused) row once in the first call to table.selectRow() and any subsequent call to table.selectRow() selects the same row for all VU iterations.

@na--
Copy link
Member Author

na-- commented May 19, 2021

We should strive to implement it separating concerns of loading, parsing and accessing data.

The flow of data is as follows, composed of different parts that can be swapped out to meet the required properties:

  1. Load data from "Source" (file, string, HTTP response in setup() function, etc.)
  2. Parse data into a SharedTable (a SharedArray where columns can be referred to by name and access to rows is proxied through a method to make sure row selection rules are followed)
  3. Access data in table according to desired "Consumption Pattern"

I agree with this, but I'm also very confused... 😅 The proposed SharedTable seems to heavily mix all three of these concerns - it has data loading, processing and access patterns all in the same object... 😕 At the same time doesn't address streaming data or allow for simple data structures (e.g. a plain JS array) to be segmented.

Moreover, in terms of the first 2 parts, it seems like it's duplicating SharedArray's already existing functionality without adding anything extra. Despite its name, SharedArray only requires the top-level data structure to be an array, its actual array elements can be anything JSON supports. This code currently works:

import { SharedArray } from 'k6/data';

const data = new SharedArray('some name', function () {
  return [
    ['we can have', 'arrays here'],
    { 'but': 'this is an object' },
    'we can have anything, as long as',
    'the top level is an array and elements are JSON-encodable',
    42
  ];
});

export default () => {
  console.log(data[1]['but']);
  console.log(JSON.stringify(data, null, 4));
}

So, if we want a clean JS API with separation of concerns (i.e. a composable API instead of a mega-object that does everything), and if we want to support these use cases:

  • Allow (pseudo-)sequential access to data rows
  • Allow random access to data rows
  • Allow data rows to be used once and once only
  • Allow sticky data rows, same data pulled randomly from set should be returned when called again from same VU

It seems to me that the only missing piece from k6 currently are some sort of iterators or generators to facilitate these data access patterns? We already have the data storage (SharedArray or plain JS arrays), and iterators/generators should work on streaming data structures as well as on static ones. And if we have these base building blocks, we can then compose them into higher-level and more user-friendly data structures like your SharedTable in various ways, purely in JS.

Something from the "k6 responsibilities" section also seems wrong to me:

  • Provide user with API to bail/abort a VU if it runs out of data rows

"Aborting a VU" is far from a simple thing - it doesn't really make sense in all executor types. It seems reasonably simple to conceptualize and maybe even implement in the shared-iterations, per-vu-iterations and constant-vus executors. But what about ramping-vus - if a VU has been "bailed" from because it ran out of data, but we later also ramp-down "below" it, it gets returned back to the global VU pool. If we then ramp-up and use it again, is it still aborted? How would that work?

Aborting VUs in the arrival-rate executors makes even less sense, since VUs are not the main thing there, the iteration pacing and iteration numbers are. VUs are simply workers, the substrate iterations are executed on at the specified arrival rate. Aborting a VU doesn't make any sense, it will just reduce the pool of workers, it won't stop the iteration pacing... 😕

In general, tying data segmentation too closely to VUs isn't the best solution. It should be possible to do, when that makes sense, but always doing it will just repeat some of the same problems relying on __VU currently has, just one level down.

  1. Update row: Controls when VU should update the data row its consuming data from. Again there are three options:
  • Each iteration: update the data row selected by VU at the start of each iteration
  • Each access: update the data row selected by VU on each access to the data row
  • Once: update the data row only once

Do we really need this? Maybe in some high-level wrapper, but "once" and "each iteration" can be boiled down to "each access" where we "access" the iterator only once, or only at the start of the iteration, and then we cache the result.

When out of rows

I mentioned above why "Abort VU" is not always possible, while "Continue in cyclical selection pattern" and "Continue with last row" can probably be implemented as simple JS wrappers around the generic (segmented or not) iterators (e.g. modulo division and by just caching the last value and returning it when we run out).

Segmentation

...

  • Number: user decides how many rows each VU should have access to (I think deciding per VU would make most sense from a user perspective, rather than per load gen machine which is something the user might be completely abstracted away from, say in k6 Cloud)

Again, same problems with tying this too closely with VUs - it doesn't really work well with arrival-rate or ramping-vus executors.

The combination of these parameters defines what data rows (from the SharedTable) are to be selected by each VU for each iteration and each access/row selection (within an iteration):

  • Sequential+Each iteration: each VU starts from the top of the SharedTable, the first row, and iterates over rows top to bottom selecting a new row at the start of each iteration.
  • Sequential+Each access: each VU starts from the top of the SharedTable, the first row, and iterates over rows top to bottom selecting a new row every time the VU requests data.
  • Sequential+Once: each VU always uses data from the first data row. No other data rows would be used.
  • Random+Each iteration: each VU selects a random row for each iteration, each call to table.selectRow() return the same row for the duration of a full VU iteration.
  • Random+Each iteration: each VU selects a random row each time a call to table.selectRow() is made.
  • Random+Once: each VU selects a random row once in the first call to table.selectRow() and any subsequent call to table.selectRow() selects the same row for all VU iterations.
  • Unique+Each iteration: each VU selects a unique (previosuly unused) row for each iteration, each call to table.selectRow() return the same row for the duration of a full VU iteration. When out of rows the whenOutOfRows option tells us what to do.
  • Unique+Each iteration: each VU selects a unique (previosuly unused) row each time a call to table.selectRow() is made. When out of rows the whenOutOfRows option tells us what to do.
  • Unique+Once: each VU selects a unique (previously unused) row once in the first call to table.selectRow() and any subsequent call to table.selectRow() selects the same row for all VU iterations.

Besides the problems of tying these use cases too closely to VUs I already mentioned above, my other problem is that not all of these combinations make sense. For example, in what situation would someone use Sequential+Once?

And again, baking both Each iteration and Once into the k6 core code, when they can be easily achieved though Each access and a variable seems a bit unnecessary. I am doubtful they will even be very useful in a high-level wrapper API, i.e. they will obscure details and add more confusion than they will bring usability. But even if they are very useful, we probably should not implement them for a low-level k6 API, which the initial MVP version of this feature definitely should be, but in a JS wrapper.

So far I've only disagreed with the SharedTable proposal above, I'll later write up a proposal for a potential MVP version of this in a separate comment here.

@na--
Copy link
Member Author

na-- commented May 19, 2021

To get back to basics, let's start with an example.

Say that we have a list of 5 elements: data = [E0, E1, E2, E3, E4]. This can be a simple JS array, or a SharedArray, or a CSV file, or anything else with integer indexes - it doesn't really matter, we'll just use the fact that we have 5 elements, indexed from 0 to 4.

Say that we also have 2 separate "actors", A and B, that want to consume elements of this data array. These "actors" can also be pretty much anything, for example:

  1. different VUs in a constant-vus scenario
  2. different iterations in a shared-iterations scenario (VUs might not be the leading consideration here, but rather we could consider each iteration number as the thing that determines the "actor" ID)
  3. iterations of an arrival-rate executor in different k6 instances - similarly, the iteration number is the leading identifier here
  4. even different parts of the same iteration in a VU that requires different data (e.g. for an http.batch() request) can be considered different actors 🤷‍♂️

So yeah, I'll use "actors" instead of VUs. Based on the user's specific use case, these "actors" might need new data elements once per VU or multiple times per iteration, or anything in between, and we need to be flexible enough to support all of these use cases. The updateRow concept suggested above seems reasonable at a first glance, but I think the better way to do this is to expose as much information to users about the execution as reasonably possible (i.e. #1320) and make it possible to manually control these "need to get a new element" decisions with plain JS code otherwise.

The other consideration the pattern in which A and B get new elements from the data array, i.e. the rowSelection suggested above:

  1. The simplest one is random. I'd say that k6 already fully supports that pattern with Math.random() and simple helpers like randomIntBetween() and randomItem().

  2. sequential is a bit more complex:

    • for executors like constant-vus, rapming-vus and per-vu-iterations, where each VU can save its state (i.e. data iterator) in a local variable between iterations, it's easy and already supported.
    • for executors like shared-iterations, constant-arrival-rate and ramping-arrival-rate, it's a bit more complicated, since we don't have a way to save or pass the iterator state between different VUs, but once we have the execution information API (Improve execution information in scripts #1320 / Core changes for k6/x/execution JS module #1863) and k6 can answer the question "what is the current iteration number in the whole scenario", we can use that for the counter and also satisfy this use case without any other k6 changes.

    And, of course, when we have a fixed-size array of data, we can easily wrap around with modulo division and reuse elements, if we need to.

  3. unique is the biggest missing piece right now. In some situations it might be achieved by simply reusing the information from the upcoming k6/execution JS API I linked to above (e.g. using the scenario iteration number as an array index), but not always (e.g. to segment things based on the VUs in an executor).

To explore the unique data selection pattern, say that we want to segment the 5 data elements [E0, E1, E2, E3, E4] as equally as possible between the two "actors" A and B, with reuse. What are all of the possible ways to do that? Turns out, a lot:

  1. chunked, sticky (no actor ever uses the same element as another actor):
    i Element Actor
    0 E0 A
    1 E1 A
    2 E2 A
    3 E3 B
    4 E4 B
    5 E0 A
    6 E1 A
    7 E2 A
    8 E3 B
    9 E4 B
    10 E0 A
    ... ... ...
  2. chunked, non sticky:
    i Element Actor
    0 E0 A
    1 E1 A
    2 E2 A
    3 E3 B
    4 E4 B
    5 E0 B
    6 E1 B
    7 E2 B
    8 E3 A
    9 E4 A
    10 E0 A
    ... ... ...
  3. striped (interleaved), sticky:
    i Element Actor
    0 E0 A
    1 E1 B
    2 E2 A
    3 E3 B
    4 E4 A
    5 E0 A
    6 E1 B
    7 E2 A
    8 E3 B
    9 E4 A
    10 E0 A
    ... ... ...
  4. striped, non sticky:
    i Element Actor
    0 E0 A
    1 E1 B
    2 E2 A
    3 E3 B
    4 E4 A
    5 E0 B
    6 E1 A
    7 E2 B
    8 E3 A
    9 E4 B
    10 E0 A
    ... ... ...

My suggestion for an MVP version of this feature is to:

  1. Expose some of the execution segment and segment sequence logic to the JS code. For now, maybe simply exposing the current k6 ES and ESS in a read-only way.
  2. Add JS iterators/generators that use can them. For example, a striped non-sticky iterator for actor A (e.g. instance 1) will return values [0, 2, 4, 6, 8, 10, ...], while the same iterator for actor B (e.g. instance 2) will return values [1, 3, 5, 7, 9, 11, ...]. These iterators will be thread-safe (so usable cross-VUs and in pretty much any executor type), so they should be able to satisfy a wide range of requirements.
  3. (potentially future step) Create a JS wrapper around 2. and SharedArray that does some of the things @robingustafsson proposed with SharedTable
  4. (definitely future step) Maybe allow the creation of custom sub-segments (e.g. based on VUs in an executor)

I'm probably missing something important here, but I think we should start as simple as possible (and iterators are pretty simple) and try to satisfy as many use cases as possible with as little code as possible, even if that simple code requires a lot of JS boilerplate initially. We can always simplify the boilerplate later with JS or Go wrappers, as long as the foundation is sound.

@robingustafsson
Copy link
Member

I agree with this, but I'm also very confused... 😅 The proposed SharedTable seems to heavily mix all three of these concerns - it has data loading, processing and access patterns all in the same object... 😕 At the same time doesn't address streaming data...

I purposefully didn't want to touch the topics of loading, processing or streaming of data. I see those as seperate topics. I wanted to focus on how to control how data is consumed during a test, as that is a frequent topic when we talk to users and customers. That data is loaded and parsed naively, ie. not streamed, memory-mapped or whatever, is fine initially IMO.

...or allow for simple data structures (e.g. a plain JS array) to be segmented.

The proposal does certainly allow for simple data structures to be consumed according to the specified patterns, not by segmenting the data per se between load gen nodes, but the actual access pattern which is the more interesting of the two from a user perspective I'd say.

Moreover, in terms of the first 2 parts, it seems like it's duplicating SharedArray's already existing functionality without adding anything extra.

Yes, it's on purpose very close to SharedArray and it might even make sense to make them one and the same. The biggest difference is the enforcement of column names and the addition of options, specifically options around how the data should be consumed (and restricting the consumption via the selectRow() API and not allowing indexing).

It seems to me that the only missing piece from k6 currently are some sort of iterators or generators to facilitate these data access patterns?

My proposal actually started out based on following the JS Iterable and Iterator protocols [1], but what I could come up with always felt like the wrong abstraction level (too low-level) for end-users hence the almighty SharedTable :smiling:

Something from the "k6 responsibilities" section also seems wrong to me:

Provide user with API to bail/abort a VU if it runs out of data rows
"Aborting a VU" is far from a simple thing - it doesn't really make sense in all executor types...In general, tying data segmentation too closely to VUs isn't the best solution. It should be possible to do, when that makes sense, but always doing it will just repeat some of the same problems relying on __VU currently has, just one level down.

This is a really good point. I do think giving the user the control of how to handle the "we're out of data" situation is important, but you're completely right that we shouldn't tie it to a VU per-se. It should probably be thought of more as a "should more iterations be run in this k6 process when we run out of data" and whether that would involve spawning more VUs or continuing with existing VUs is irrelevant.

  1. Update row: Controls when VU should update the data row its consuming data from. Again there are three options: Each iteration, Each access and Once
    Do we really need this? Maybe in some high-level wrapper, but "once" and "each iteration" can be boiled down to "each access" where we "access" the iterator only once, or only at the start of the iteration, and then we cache the result.

From a user perspective it's needed yes, but in the first abstraction level in whatever API we agree to, maybe not (we can probably implement it as you point out).

Whatever the MVP of this feture will be, the important thing IMO is that the API is at an abstraction level that's useful to users, so I think having an API on a similar abstraction level as proposed with SharedTable is needed. If it's done in pure JS on-top of a more bare-bones iterators foundation that's not so important from a user perspective in the short term.

As I struggled to come up with an iterators based proposal I'd love to see what we can come up with that would allow us to build a SharedTable like API as a higher-level abstraction. I had something as follows at one point (with the ramping-vus executor in mind as that is by far the most common one), but again I think a higher-level API would be more helpful to most users:

Sequential+Each iteration

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = data[Symbol.iterator](); // Extending SharedArray with Iterable and/or Iterator protocol support

export default function() {
    let row = iter.next().value; // Each iteration a new row will be used
    ...
}

Sequential+Each access

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = data[Symbol.iterator]();

export default function() {
    // Call `iter.next()` each time a new row is needed
    let row = iter.next().value;
    ...
    row = iter.next().value;
    ...
}

Sequential+Once

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let row = data[0]; // First value (and same for last value or whatever index)

export default function() {
    // Use `row` throughout test, it will be the same every time
}

Random+Each iteration

import { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});

export default function() {
    let row = randomItem(data); // Each iteration a new random row will be used
    ...
}

Random+Each access

import { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});

export default function() {
    // Call `randomItem(data)` each time a new random row is needed
    let row = randomItem(data);
    ...
    row = randomItem(data);
    ...
}

Random+Once

import { randomItem } from "https://jslib.k6.io/k6-utils/1.0.0/index.js";

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});

let row = randomItem(data); // First value (and same for last value or whatever index)

export default function() {
    // Use `row` throughout test, it will be the same every time
}

Unique+Each iteration

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]

export default function() {
    let obj = iter.next(); // Each iteration a new random row will be used
    if (obj.done) {
        // Handle accordingly, eg. `abortFurtherIterations()`, `abortTest()` and similar future APIs.
    }
    let row = obj.value;
    ...
}

Unique+Each access

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]

export default function() {
    // Call `iter.next()` each time a new row is needed
    let row = iter.next().value;
    ...
    row = iter.next().value;
    ...
    
}

Unique+Once

let data = new SharedArray("some name", function() {
    ... // Data is segmented here according to scope (test, scenario, VU or iteration) using execution segments and info?
});
let iter = UniqueIterator.from(data); // Implementing the Iterable and Iterator protocols [1]

let row = iter.next().value; // Unique across VUs

export default function() {
    let row = iter.next().value; // Unique across iterations
}

With this kind of API we still need to figure out how to segment data according to different scopes like test, scenario, VU or iteration, so that the iterators only need to concern themselves with iterating over the given slice of data according to their specific access pattern. I suppose as long as each k6 process (i.e. load gen) in a test has all data it can then segment it according to the users desired scoping (test, scenario, VU or iteration) by using execution segments and the new k6/execution module as you say. Something like this:

import exec from 'k6/execution';

let data = new SharedArray("some name", function() {
    ... // All data is always loaded here
});

export default function() {
    const scenarioStats = exec.getScenarioStats();
    let iter = UniqueIterator.from(
        // Would calculate the appropriate segment of data to use based on execution segment for the current load gen + scenario and scenario iteration offset.
        DataSegment.from(data, {
            scope: 'scenario',
            offet: scenarioStats.iteration
        })
    );
    let obj = iter.next(); // Would give a scenario-level unique row per iteration (if only one row is consumed per iteration)
    if (obj.done) {
        ... // Handle when-out-of-rows case
    }
}

...but this feels very low-level. A tangent, but how does the k6/execution APIs work across load gens and load zones in the cloud, what is the correct value for getScenarioStats().iteration, is it the process-local scenario iteration or the test-wide scenario iteration? It looks like the former by quickly skimming #1863, but I wasn't sure before looking at the source.

[1] - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols

@imiric
Copy link
Contributor

imiric commented May 25, 2021

I won't comment on the on-topic discussion here as it will take me a few hours to dig into and properly respond 😅, but to address the last question by @robingustafsson: there are several counters introduced in #1863, one of which is the iterationGlobal value that takes into account the configured execution segment and returns the iteration number for a scenario across instances. The instance-local scenario iteration is returned as the iteration value from getScenarioStats(). (This needs a lot of documentation to clearly explain to users what each one means.)

@robingustafsson
Copy link
Member

@imiric Ah, great, and based on process-local information (i.e. execution segments) as it looks 👍

@codebien
Copy link
Collaborator

I mostly collected a list of use cases to better understand the demand for the various points of the feature. I also tried to apply a basic proposal that I think is far from perfect but it has the advantage to put on the table a lot of limits that we can consider as a checklist for any solution we will pick.

Use Cases

Use Case Summary Scope Counters supported
https://community.k6.io/t/how-to-distribute-vus-across-different-scenarios-with-k6/49/11 One item per VU, but with a local per Scenario counter. Per VU?
https://community.k6.io/t/unique-test-data-per-vu-without-reserving-data-upfront/1136/3 One unique item per VU. Not cyclical. Per VU
https://community.k6.io/t/data-parameterisation-with-unique-index/2128 Random? and **unique ** item per Iteration. Not cyclical. Per Iteration
https://community.k6.io/t/when-parameterizing-data-how-do-i-not-use-the-same-data-more-than-once-in-a-test/42 One unique item per Iteration (or one item per VU). Per Iteration/VU
https://community.k6.io/t/unique-test-data-per-vu-without-reserving-data-upfront/1136/5 One item per VU, where items < VUs. Cyclical. Per VU
grafana/k6-operator#64 (chunks) Chunks of items per VU. Per VU
https://community.k6.io/t/how-to-load-json-from-a-file-per-vu-iteration/2762 Reading a file without enough memory available for loading it in one shot. (Streaming API?) -
https://community.k6.io/t/shared-state-or-unique-sequential-vu-index-per-scenario/1156 One item per VU, but with a local per Scenario counter. Per VU?
https://community.k6.io/t/share-data-between-two-scenarios-in-k6/1482 One item per iteration, across scenarios. Per Test
https://community.k6.io/t/when-parameterizing-data-how-do-i-not-use-the-same-data-more-than-once-in-a-test/42/17 One item per iteration, across scenarios. Per Test
https://community.k6.io/t/how-to-load-a-csv-file/251/3 One item per VU. Per VU
https://community.k6.io/t/how-to-query-unique-request-by-graphql/2794 One unique item per VU. Per VU

The most requested features accordingly to the previous table are the incremental index per VU or Iteration. Strictly sequential, randomness and Once have a lower demand. It's highly expected considering the work done for the Execution API or the old context variables __VU and __ITER.

Per VU/Iteration with a not strictly by-one sequence are already supported by the system with the introduction of the Execution API. It would mean that we could already support the most required cases designing the consumption and/or iterators API based on the values returned by scenario.iterationInTest and vu.idInTest or the equivalents from the Go code.

My lack of knowledge of the entire ecosystem, which is, at the moment, mostly based on the forum and repo's issues, could make me see a partial and/or wrong vision of our requirements.

Basic Proposal

For the first iteration, the proposal is to write Table and UniqueIterator types for allowing an easier access to SharedArray like data structure and any Iterables combined with the Execution API. It would support without breaking change an eventual movement to a segmented-like index.

import { Table } from './table.js'
import { UniqueIterator } from 'k6/data'

// Iterator should be optional and if it isn't provided,
// the Iterable's Iterator (builtin or custom) will be used.
var iter = new UniqueIterator()
var table = new Table(
  ['first name', 'last name', 'age'],
  [
    ['Joe0', 'Doe0', '26'],
    ['Joe1', 'Doe1', '27'],
    ['Joe2', 'Doe2', '28']
  ],
iter)

// If init context then per VU
//let user = table.next()

export default function() {
  // if not init context then per Iteration
  let user = table.next()
  console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
}

Table API

The Table API should be responsible for resolving the rows from the Source getting an index from the Iterator, then it would map it into an Object/Map associating its values with the relative headers.

Global and scoped Iterator

Code outside of it is called "init code", and is run only once per VU.

The idea has been inspired by this doc sentence, so the concept is to get the right counter based on the scope of the caller. If the next function is invoked from the Init context then an index's value per VU should be returned, instead per Iteration value should be used if the call comes from the iteration's function.

It has the downside to return different values without any correlation if the next function for the same test is invoked from different scopes, it also would require to make the init context aware of the VUIDGlobal.

The previous idea doesn't cover an eventual per Test iterator. I think we could use a global default instance of the UniqueIterator for that (similar to the concept of the HTTP's DefaultClient in Go).

Out of rows

I think the API should cover the cyclical selection as default and it fires a callback or exports a boolean API for checking if the iterator is out of rows. In this way, the user is free to apply any possible logic when it hits the out of rows state.

export default function() {
  // true if the latest item has been returned
  if (table.isOutOfRows()) {
    console.log('Out of rows')
    return; // or exec.test.abort if it unexpected
  } else {
    let user = table.next()
    console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
  }
}

Once

Doing in this way, the Init context couldn't be used for the Once case so a workaround like the following must be applied for it:

var firstRow // or last

export default function() {
  // true if the latest item has been returned
  if (firstRow === undefined) {
    firstRow = table.next()
  }
  console.log(`Name: ${user['name']}, Surname: ${user['surname']}`)
}

Streaming API

This is not yet covered by me but the feeling is that the counter-based solution could be used for achieving an incremental reading of the streaming. Ideally, the API should read the stream until it has enough data for returning a value for the requested index's value.

Counters' limits

Each Access

Each Access is not supported by global counters because they can't be increased "on-call", they follow the test's life-cycle so if the next() function is called multiple times from the same context then it returns always the same index's value.

Chunks

As reported in the previous comment by @na-- and mostly for the same reason of Each Access, the counters don't support sequential chunks of data distributed across the Actors.

VUIDGlobal

The VUIDGlobal is not accessible from the Init context, (maybe could we make it available?). Of course, it is accessible from the iteration's function, so in the case we wouldn't and/or couldn't access the counter from the init context then the iteration's function must be the alternative for using global counters.

Reset Index

Reset the index for re-starting is not supported case with global counters (it could be fixed in local, but I don't think in distributed cases).

Test Scope

We don't have a counter that goes across scenarios, maintain the sequence across scenarios then it wouldn't be supported. Fixing it would require an additional counter.

Perfect Sequence

However, while every instance will get non-overlapping index values in cloud/distributed tests, they might iterate over them at different speeds, so the values won't be sequential across them.

As reported in the docs, the global counters could not respect a perfect sequence creating unexpected holes.

Open questions

  • What do we mean with Unique ? It's the item never used at the same time from multiple Actors or that the entire scope it has used it only once?
  • Do we have more concrete use-cases covering random, once, each access?
  • Do some of the previous concepts collide again with the "too closely with VUs" anti-pattern?
  • Should the Table API return a row using an Object or a Map? I think it depends if we want the Map'sget method and if we prefer to have a direct Iterable.
  • Does it exist (and does it makes sense) a difference between Iteration and Scenario scopes in this context?

@na-- na-- modified the milestones: v0.37.0, v0.38.0 Mar 2, 2022
@mstoykov
Copy link
Collaborator

Global and scoped Iterator

I did not understand what the idea is here, sorry. Can you expand possibly with a script sample with some comments?

Out of rows

is the idea here that if that check isn't used and next() is called it will:

  1. cycle and start from the beginning and the check will start:
    1. return false again
    2. keep returning true though
  2. return null/throw exception

Once

I am even more confused by this example

Test Scope

We don't have a counter that goes across scenarios, maintain the sequence across scenarios then it wouldn't be supported. Fixing it would require an additional counter.

This in fact is not possible(or at least will require a lot of synchronization between k6 instances) if you have a vu based scenario - ramping-vus/constant-vus can do different numbers of iterations each time it's ran and it will likely do different numbers on separate instances. This is unlike all other which are iteration based (except externally-controlled one which we just ignore ;)) - for them we can calculate how many iterations they should make. This still doesn't mean that they will make all of those iterations. arrival-rate can drop iterations due to no free VU to take them. All other can run out of time. And on top of that there is nothing stopping an iteration to throw an exception half the time - before actually accessing or doing anything with "its" data.

Perfect Sequence

However, while every instance will get non-overlapping index values in
cloud/distributed tests, they might iterate over them at different speeds, so the values won't be sequential across them.

As reported in the docs, the global counters could not respect a perfect sequence creating unexpected holes.

As mentioned above this has even more problems. This in practice will require that each "getting" of an item is synchronized. But I will also argue this case is really ... not possible in the strictest sense of the word. Problems are that:

  1. just because you got a value - doesn't mean you will actually use it, exceptions are still possible
  2. just because one VU on one instance got value before another VU (on another instance for example) it doesn't mean that it will use it before it.

So at least for me this use case is very badly defined and if we are okay with the two types of holes it creates I guess we are fine with using global IDs as well :). And if not this basically requires multiinstance database of some sort that gives you the next item(s). Arguably something that can be done with running redis (or something else) and making requests to it. Possibly making it better integrated with k6, but probably better to start as a some JS script helpers and a project showcasing how to use them.

Next steps (IMO):

All in all I think we should really just make How-to guides for all the cases currently supported, possibly building some helper functions around them to test out APIs. Some of the "workarounds" for unsupported cases also should probably have full examples so they can be better evaluated.

Some of those are already in comments we just need to add them to the documentation IMO.

Some prior code by me that definitely needs more work but can be used as an idea.

@sniku sniku changed the title Data segmentation API framework Data segmentation API framework - docs only Apr 13, 2022
@na-- na-- modified the milestones: v0.38.0, v0.39.0 Apr 26, 2022
@codebien codebien modified the milestones: v0.39.0, v0.40.0 Jun 15, 2022
@na-- na-- changed the title Data segmentation API framework - docs only Data segmentation API framework Aug 12, 2022
@na-- na-- modified the milestones: v0.40.0, v0.42.0 Aug 19, 2022
@codebien codebien removed their assignment Oct 20, 2022
@mstoykov mstoykov modified the milestones: v0.42.0, TBD Nov 9, 2022
@codebien codebien removed this from the TBD milestone Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6 feature
Projects
None yet
Development

No branches or pull requests

5 participants