Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic profiler and report generation module integration #824

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions transforms/code/code_profiler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,13 @@ The high-level system design is as follows:
For each new target language, the offline phase is utilized to create deterministic rules by harnessing the capabilities of LLMs and working with exemplar code samples from the target language. In this process, Workflow W1 facilitates the creation of rules around syntactic structures based on exemplar code samples, while Workflow W2 is used to establish semantic dimensions for profiling. Subsequently, we derive rules that connect syntactic constructs to the predefined semantic concepts. These rules are then stored in a rule database, ready to be employed during the online phase.

In the online phase, the system dynamically generates profiling outputs for any incoming code snippets. This is achieved by extracting concepts from the snippets using the rules in the database and storing these extractions in a tabular format. The structured tabular format allows for generating additional concept columns, which are then utilized to create comprehensive profiling reports.

The following runtimes are available:
* [python](python/README.md) - provides the base python-based transformation
implementation and python runtime.
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime

Please refer to the playbook at `transforms/code/code_profiler/notebook_example/code-profiler.ipynb` to run the pythonic code profiler


5 changes: 3 additions & 2 deletions transforms/code/code_profiler/input/data_profiler_params.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
{
"input": "multi-package.parquet",
"contents": "Contents",
"language": "Language"
"dynamic_schema_mapping": "True",
"contents": "contents",
"language": "language"
}
Binary file modified transforms/code/code_profiler/input/multi-package.parquet
Binary file not shown.
1,130 changes: 96 additions & 1,034 deletions transforms/code/code_profiler/notebook_example/code-profiler.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions transforms/code/code_profiler/python/Makefile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajskku I tried to run make run-local-python-sample on my macOS and got the following error below. I could not find anything in the readme.md to guide me. Is it possible to update the README.md to provide additional configuration needed to run the sample code.

Bindings bindings_dir: /Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c
Bindings path: /Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64
14:41:42 ERROR - Exception creating transform  dlopen(/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so, 0x0006): tried: '/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (no such file), '/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (not a mach-o file)
Traceback (most recent call last):
  File "/Users/touma/data-prep-kit-code-profiler/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_file_processor.py", line 51, in __init__
    self.transform = transform_class(self.transform_params)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/code_profiler_transform.py", line 92, in __init__
    CSHARP_LANGUAGE = Language(os.path.join(bindings_path, 'c_sharp-bindings.so'), 'c_sharp')
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/venv/lib/python3.11/site-packages/tree_sitter/__init__.py", line 132, in __init__
    self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.10/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py", line 454, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.10/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so, 0x0006): tried: '/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (no such file), '/Users/touma/data-prep-kit-code-profiler/transforms/code/code_profiler/python/src/tree-sitter-bindings-a2ed8cfe-8fa8-49fd-9ffa-f78a8b10c08c/x86_64/c_sharp-bindings.so' (not a mach-o file)

Copy link
Member Author

@pankajskku pankajskku Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Maroun,

I have updated the README to guide the user on how to run the transform on their host.
The code profiler can be run on mach-arm64 and x86_64 host architectures. Please change the RUNTIME_HOST_ARCH in the Makefile depending on your host architecture.

#values possible mach-arm64, x86_64
export RUNTIME_HOST_ARCH=x86_64

As these are .so bindings, you may need to permit your Mac to load them from the security settings. Generally, you get the pop-up here under the tab security. If not, I would recommend you use x86_64 arch.

image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajskku is there a reason every time we run the code, we create a new folder src/tree-sitter-bindings-* each having 162MB ? What is the reason for having a uuid in the folder name ? Also, in addition to src/tree-sitter-bindings-, I also have copies of the same files in python python3.11/site-packages/tree-sitter-bindings- . Maybe a call to discuss how these files are used/delivered may be needed. Please give it some thoughts on how we can simplify things maybe be even deliver the files as part of the pip install of the transform. I will schedule a call for Monday morning if that is ok with you. Thanks

Copy link
Member Author

@pankajskku pankajskku Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Maroun,

The tree-sitter bindings convert the source code to the abstract syntax tree. Each language has its bindings. The bindings are cloned from a public repo (https://github.com/pankajskku/tree-sitter-bindings/tree/main) and deleted after the program exits. But, the failure case wasn't handled properly therefore, the cloned folder wasn't deleted. I have added a check to handle the exception and clean the bindings .so in the updated PR. I couldn't find python3.11/site-packages/tree-sitter-bindings in my venv. We can also discuss this on Monday. Thanks.

Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ setup:: .transforms.setup
set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=$(CODE_PROFILER_PYTHON_VERSION) TOML_VERSION=$(CODE_PROFILER_PYTHON_VERSION) .transforms.set-versions

build-dist:: .defaults.build-dist
build-dist:: .defaults.build-dist

publish-dist:: .defaults.publish-dist

Expand All @@ -51,5 +51,5 @@ run-local-sample: .transforms.run-local-sample

run-local-python-sample:
$(MAKE) RUN_FILE=code_profiler_local_python.py \
RUN_ARGS="--content 'Contents' --language 'Language'" \
RUN_ARGS="--content 'contents' --language 'language'" \
.transforms.run-local-python-sample
11 changes: 11 additions & 0 deletions transforms/code/code_profiler/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ the options provided by
the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).

### Running the samples

The code profiler can be run on mach-arm64 and x86_64 host architecture.
Depending on your host architecture, please change the `RUNTIME_HOST_ARCH` in the Makefile.
```
# values possible mach-arm64, x86_64
export RUNTIME_HOST_ARCH=x86_64
```
If you are using mac, you may need to permit your Mac to load the .so from the security settings. Generally, you get the pop-up under the tab security while running the transform.

![alt text](image.png)

To run the samples, use the following `make` targets

* `run-local-sample` - runs src/code_profiler_local.py
Expand Down
1 change: 1 addition & 0 deletions transforms/code/code_profiler/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ license = {text = "Apache-2.0"}
readme = {file = "README.md", content-type = "text/markdown"}
authors = [
{ name = "Pankaj Thorat", email = "pankaj.thorat@ibm.com" },
{ name = "Aishwariya Chakraborty", email = "aishwariya.chakraborty1@ibm.com" },
]

dynamic = ["dependencies"]
Expand Down
3 changes: 2 additions & 1 deletion transforms/code/code_profiler/python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -95,4 +95,5 @@ tzdata==2024.1
urllib3==2.2.2
uuid
wcwidth==0.2.13
wrapt==1.16.0
wrapt==1.16.0
plotly==5.15.0
33 changes: 31 additions & 2 deletions transforms/code/code_profiler/python/src/UAST_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,8 +228,9 @@ def _add_user_defined(self, node):
return

# Traversing through the AST to create nodes recursively.
def _dfs(self, AST_node, parent) :
if (AST_node.type in self.rules) :
def _dfs(self, AST_node, parent):

if (AST_node.type in self.rules):
ast_snippet = AST_node.text.decode("utf8")
node_type = self.rules[AST_node.type]["uast_node_type"]
exec_string = self.rules[AST_node.type]["extractor"]
Expand Down Expand Up @@ -269,3 +270,31 @@ def _extract(self, ast_snippet, node_type, exec_string):
return self.grammar[node_type]["keyword"] + " " + self.extracted
except Exception as e:
print(e)

def uast_read(jsonstring):
"""
Reads an input json string into UAST class object
"""
uast = UAST()
if jsonstring is not None and jsonstring != 'null':
uast.load_from_json_string(jsonstring)
return uast
return None

def extract_ccr(uast):
"""
Calculates the code to comment ratio given an UAST object as input
"""
if uast is not None:
total_comment_loc = 0
for node_idx in uast.nodes:
node = uast.get_node(node_idx)
if node.node_type == 'uast_comment':
total_comment_loc += node.metadata.get("loc_original_code", 0)
elif node.node_type == 'uast_root':
loc_snippet = node.metadata.get("loc_snippet", 0)
if total_comment_loc > 0:
return loc_snippet / total_comment_loc
else:
return None
return None
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@
local_conf = {
"input_folder": input_folder,
"output_folder": output_folder,
"contents": "Contents",
"language": "Language"
"contents": "contents",
"language": "language"
}
params = {
# Data access. Only required parameters are specified
Expand Down
Loading