Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate environment CSV generation code so that we can have CI/CD information generation in one place #11134

Open
Tracked by #10718
tt-rkim opened this issue Aug 6, 2024 · 5 comments
Assignees
Labels
data-collection collab ticket with data science team infra-ci infrastructure and/or CI changes P2

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Aug 6, 2024

We currently generate the environment.csv for a benchmark run in the same job as the actual benchmark run. However, it uses duplicate code because we were trying to get up and running.

We should work to de-duplicating this, which includes:

  • producing it one place
  • uploading it one place, which may involve moving artifacts around

cc: @TT-billteng @skhorasganiTT @uaydonat

@skhorasganiTT
Copy link
Contributor

@tt-rkim Which duplicate code are you referring to?

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Aug 7, 2024

Because there was such a rush from the teams to get this up and running and there were oddities with getting this information within the running job, I made the decision to make some duplicate code. We generate a separate ci/cd environment CSV in a separate post-processing job as well. So that's the duplication.

It's not code you wrote, but we believe models team should own everything generated within the demos pipeline, as the original proposal for this information came from model's team.

@skhorasganiTT
Copy link
Contributor

It was my understanding that the generation of the run and measurement CSVs would be owned by the models team while the generation of the environment CSV (which is intentionally separate due to its information being specific to the CI environment) would be owned by the infra team. @uaydonat is this correct?
For the duplicate code, the csv saving function used for run/measurements here can be re-used, but besides that the duplication seems CI specific.

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Aug 7, 2024

Sounds good, but from now on anything to do with benchmarking schemas or extra data generation for benchmarking is to be owned by models team, including any additions to CI information.

@tt-rkim tt-rkim added the data-collection collab ticket with data science team label Aug 21, 2024
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Aug 21, 2024

@FrancesssZ has confirmed that we need to upload the files at the same time for the processor to not see NULLs.

We may need to move the upload to the produce data workflow in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-collection collab ticket with data science team infra-ci infrastructure and/or CI changes P2
Projects
None yet
Development

No branches or pull requests

2 participants