Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO: Integrate CERN data model with IETF ALTO format #37

Open
jacobdunefsky opened this issue Apr 1, 2022 · 3 comments
Open

ALTO: Integrate CERN data model with IETF ALTO format #37

jacobdunefsky opened this issue Apr 1, 2022 · 3 comments
Assignees

Comments

@jacobdunefsky
Copy link

Get access to CERN data and write a script to transform it into a format that can be processed by our hypothetical ML model.

@giralt giralt assigned giralt and jacobdunefsky and unassigned giralt Apr 1, 2022
@jacobdunefsky jacobdunefsky changed the title ALTO: Integrate CERN data ALTO: Integrate CERN data model with IETF ALTO format Apr 14, 2022
@jacobdunefsky
Copy link
Author

After meeting with Mario, we have a better understanding of the application-level data format used by Rucio with CERN. The next step is to design a JSON schema that both:

  1. captures the same information as the current Rucio schema
  2. adheres to the same format/structure as the rest of the IETF ALTO standard

@jacobdunefsky
Copy link
Author

The attached files provide a view of my current thoughts re: a new data model. The file "rucio-non-alto.json" is an example of Rucio's current data format; the file "alto-rucio.json" represents the same data under the proposed new format. The idea is that the latter file would be what is returned by an ALTO server.

The new format is based on RFC 8189. The main new feature beyond that params-tuple ordered dict. params-tuple specifies the dimensions of a multidimensional array in which each datapoint will be returned. For instance, in the example, consider

"cost-metric": "queued",
"params-tuple": [
	{"stage": ["Production Input", "Production Output", "total"]},
	{"unit": ["bytes", "files"]}
]

If we pass this value of params-tuple, then data will be returned as a multi-dimensional array, where the first dimension represents the stage that we are looking at (e.g. amount queued at stage Production Input), and the second dimension represents the unit of the data. Thus, an example response might be

[ [4597740396, null], [null, 1], [4597740396, 1] ]

Additionally, note that some of the dimensions of params-tuple can have cardinality 1. This is useful for specifying constraints on the data. As an example, consider

"cost-metric": "throughput",
"params-tuple": [
	{"percentile": 95},
	{"unit": "mbps"},
	{"measurement-interval": ["1h", "1d", "1w"]}
]

Note that the first two dimensions are not associated with arrays, but scalars. The result of this is that the output is constrained to have unit "mbps" and measure based on the 95th percentile of data, but no new dimension is added to the output array:

[ 1.9, 0.5, 10.12 ]

I hope that this format makes sense and is logical. The next step would be to figure out how to elegantly include timestamp information.

@jacobdunefsky
Copy link
Author

Apologies; I uploaded an outdated alto-rucio.json that doesn't correspond to rucio-non-alto.json. The correct file is as follows (Github won't seem to let me upload it):

{
	"meta" : {
		"multi-cost-types" : [
			{
				"cost-mode": "numerical",
				"cost-metric": "queued",

				"params-tuple": [
					{"stage": ["Production Input", "Production Output", "total"]},
					{"unit": ["bytes", "files"]}
				]
			},
			{
				"cost-mode": "numerical",
				"cost-metric": "throughput",

				"params-tuple": [
					{"percentile": 95},
					{"unit": "mbps"},
					{"measurement-interval": ["1h", "1d", "1w"]}
				]
			},
			{
				"cost-mode": "numerical",
				"cost-metric": "closeness"
			}
		]
	},

	"ipv4:192.0.2.2": {
		"ipv4:192.0.2.89": [
			[ [4597740396, null], [null, 1], [4597740396, 1] ],
			[ 1.9, 0.5, 10.12 ],
			2
		],
		"ipv4:192.0.2.43": [
			[ [null, 2], [1192314951, 1], [1192314951, 3] ],
			[ 0.4, null, 8.2],
			3
		]
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants