Skip to content
anaeletrica edited this page Nov 22, 2024 · 23 revisions

noWorkflow Contributor Covenant

Copyright (c) 2016 Universidade Federal Fluminense (UFF). Copyright (c) 2016 Polytechnic Institute of New York University. All rights reserved.

nwimg Dataflow Source

What is noworkflow?

noWorkflow is a tool designed to automatically trace the provenance of a Python script without requiring changes to the original code, thereby providing users with the creation and analysis of a detailed history of how data was produced and transformed. This history ensures transparency and reliability in scientific experiments and data processes. Developed in Python, noWorkflow can capture the provenance of scripts using software engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling to collect provenance without necessitating a version control system or any other external environment.

Team

The main noWorkflow team is composed by researchers from Universidade Federal Fluminense (UFF) in Brazil and New York University (NYU), in the USA.

  • João Felipe Pimentel (UFF) (main developer)
  • Juliana Freire (NYU)
  • Leonardo Murta (UFF)
  • Vanessa Braganholo (UFF)
  • Arthur Paiva (UFF)

Collaborators

  • David Koop (University of Massachusetts Dartmouth)
  • Fernando Chirigati (NYU)
  • Paolo Missier (Newcastle University)
  • Vynicius Pontes (UFF)
  • Henrique Linhares (UFF)
  • Eduardo Jandre (UFF)
  • Jessé Lima (Summer of Reproducibility)
  • Joshua Daniel Talahatu (Google Summer of Code)
  • Ana Carvalho (UFF)

History

The project started in 2013, when Leonardo Murta and Vanessa Braganholo were visiting professors at New York University (NYU) with Juliana Freire. At that moment, David Koop and Fernando Chirigati also joined the project. They published the initial paper about noWorkflow in IPAW 2014. After going back to their home university, Universidade Federal Fluminense (UFF), Leonardo and Vanessa invited João Felipe Pimentel to join the project in 2014 for his PhD. João, Juliana, Leonardo and Vanessa integrated noWorkflow and IPython and published a paper about it in TaPP 2015. They also worked on provenance versioning and fine-grained provenance collection and published papers in IPAW 2016. During the same time, David, João, Leonardo and Vanessa worked with the YesWorkflow team on an integration between noWorkflow & YesWorkflow and published a demo in IPAW 2016. The research and development on noWorkflow continues and is currently under the responsibility of João Felipe, in the context of his PhD thesis.

Contribution Timeline

Publications

Why use noworkflow?

NoWorkflow identifies dependencies, parameters, and dataflows, helping to keep a detailed history of the script executions. Speed up the check of the results in different versions, clear the way to collaboration and reproducibility, and make it easier to understand and share experiments.

Who uses noworkflow?

Research Scientists, data scientists and professionals that works with Python scripts and needs to keep track of processes and data.

Where does noworkflow apply?

Scientific research, data analysis projects, and academic environments where reproducibility is essential, such as research labs and complex experiments, demand rigorous documentation processes.

When to use noworkflow?

When you need to capture, analyze, and document data provenance in complex experiments and workflows. It meets the specific needs of PROV, Prolog, and dataflow users.

How to install?

Follow the link to Quick Install

How to use?

We invite you to join the following steps to provide a guide to the main noworkflow uses. Our tool comes with a demonstration project. To extract it, you should run:

$ cd ~\noworkflow\capture\demo_1

The demo script called simulation.py cames with input data data1.dat and data2.dat, you should run

$ now run -v simulation.py data1.dat data2.dat

The -v option turns the verbose mode on, so that noWorkflow gives you feedback on the steps taken by the tool. The output, in this case, is similar to what follows.

$ now run -v simulation.py data1.dat data2.dat
[now] removing noWorkflow boilerplate
[now] setting up local provenance store
[now] using content engine noworkflow.now.persistence.content.plain_engine.PlainEngine
[now] collecting deployment provenance
[now]   registering environment attributes
[now] collection definition and execution provenance
[now]   executing the script
[now] the execution of trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb finished successfully

Each new run produces a different trial that will be stored with a universally unique identifier in the relational database.

Notice that a .noworkflow folder has been created in the path "demo_1" and captured the script, modules, and input data used.

Visualization Tool

noWorkflow provides a visualization tool that allows interactive analysis. The vis option with -b parameter exports dataflows, prolog, and prov. It also restores trials and files and allows the visualization of activation functions. Try run:

$ now vis -b

The browser opens, and you can see a numbered circle at the top right of the window. begin1 Left-click the numbered circle to see the activation function and trial information. begin2 On the left, you can see the activation function and the command icons that allow changes. The principals are font size, activation function edge size, and download .SVG file. The right-side menu contains links to access the versions of the script, modules, and files used. Now close both the "Trial 1.1.1" and "1.1.1 - Information" windows. Right-click on the numbered circle. You can see the available options. begin3

Let's start with:

Export Dataflow

The window with principal usage functions to generate dataflow is shown. Note that the "visualization depth" default is zero, which captures all function calls that return fine-grained provenance. This representation offers visibility of system details, making it easier to track data dependencies, debug issues, and understand data transformations across different processing levels.

exportdataflow1

With Scroll, you can dimension the dataflow to the window. This is a dataflow graph that represents the processing flow of a program through different elements:

1 --> Dark Blue Rectangles (Blue boxes):

  • Represent functions or processing modules;
  • These are the main processing nodes that execute specific operations.

2 --> Light Blue Rectangles (Light blue boxes):

  • Represent variables, parameters, or intermediate data;
  • These are data that flow between different functions.

3 --> White/Colorless Rectangles:

  • Represent input/output files;
  • These are external resources that the program reads or writes.

4 --> Arrows/Edges:

  • Solid arrows: indicate direct data flow between components;
  • Dotted arrows: may indicate secondary dependencies or conditional flows;
  • The direction of the arrows shows how data moves from one component to another.

We can export dataflow in .SVG or .DOT file. exportdataflow2 This is a common type of visualization in program and system analysis, showing how data flows through different processing stages and the dependencies between various system components. However, The high level of detail can make the graph more complex, making it harder to grasp the system's overall purpose and main workflows. To change the dataflow to a coarse-grained you can change the depth as follows:

Close the "Dataflow trial 1.1.1" window. Right-click the mouse at the numbered circle to choose the "export dataflow" option again. Let's set "visualization depth" to 2, where 0 represents infinity and 1 the less. exportdataflow3

Moving forward, let's try the function:

Export Prolog

Close the "Dataflow trial 1.1.1" window. Right-click the mouse at the numbered circle to choose the "export prolog" option. exportprolog1 Than "Generate prolog": exportprolog2 The contents of the prolog are displayed on the left.

1 --> You can drag the highlighted point to scale the window to the prolog content display.

2 --> You can write the desired query.

3 --> The ways of querying are available under “FACT DEFINITION”. You can search by scrolling. exportprolog3 Let's suppose that we trace the files accessed during execution to ensure data integrity and check that changes to the files have been made properly needed, we can run the following query than click "Execute Query":

access(TrialId, Id, Name, Mode, ContentHashBefore, ContentHashAfter, Checkpoint, ActivationId)

The answer shows that the Id of the access is f12, which could be a unique identifier for the type or context of the access made. The file was accessed in w+b mode, which indicates that the file was opened for reading and writing in binary format. The values of ContentHashBefore and ContentHashAfter are equal: bee9e9be97f712a78c828de259f8c772a081e09f. This suggests there was no change to the contents of the output.png file during the access. Even though it was a read-and-write access, the contents of the file remained the same before and after the operation. The Checkpoint is 31.613632399999997 which can be useful for correlating the access with other operations or identifying where in the process the access took place. The ActivationId is 8448, which can be used to associate this access with a specific activation of code or function during the execution of the process. exportprolog4

Now, we can move on to testing the option:

Export Prov

Close the "Prolog trial 1.1.1" window. Right-click the mouse at the numbered circle to choose the "export prov" option.

exportprov1

prov is an ontology used to represent semantics and relationships between elements, especially in the context of workflows.

exportprov2

Describes how entities (data, files), activities (function executions, operations), and agents (users or systems) interact throughout an experiment or script execution.

Let's go on with:

restore file

Update in progress.