Skip to content

Commit

Permalink
Merge pull request #3 from FSoft-AI4Code/dev/extended
Browse files Browse the repository at this point in the history
Dev/extended
  • Loading branch information
minhna1112 authored Jul 12, 2023
2 parents b6a5973 + 99e6575 commit 026a807
Show file tree
Hide file tree
Showing 34 changed files with 604 additions and 84 deletions.
10 changes: 9 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,18 @@ Release data: Dec 12, 2022

Version 0.0.6
=============
Release data: Jan 9, 2022
Release data: Jan 9, 2023

* Add tree sitter utils (in codetext.parser)
* Replace all `match_from_span` to `get_node_text`
* Replace all `traverse_type` to `get_node_by_kind`
* Fix `CppParser.get_function_metadata` missing `param_type` and `param_identifier`
* Update return metadata from all parser

Version 0.0.7
=============
Release data: Jul 5, 2023

* Update all class extractor format (using dict instead of list)
* Fix missing identifier, parameter in C, C#, Java parser
* Implement CLI
137 changes: 108 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,152 @@
<div align="center">

<p align="center">
<img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
<img src="./asset/img/codetext_logo.png" width="220px" alt="logo">
</p>

**CodeText-parser**
______________________________________________________________________


<!-- Badge start -->
| Branch | Build | Unittest | Linting | Release | License |
|-------- |------- |---------- |--------- |--------- |--------- |
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
| Branch | Build | Unittest | Release | License |
|-------- |------- |---------- |--------- |--------- |
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
<!-- Badge end -->
</div>

______________________________________________________________________

**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).
**Code-Text parser** is a custom [tree-sitter](https://github.com/tree-sitter)'s grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
- Python
- Java
- JavaScript
- PHP
- Ruby
- Rust
- C
- C++
- C#
- Go

# Installation
Setup environment and install dependencies and setup by using `install_env.sh`
```bash
bash -i ./install_env.sh
```
then activate conda environment named "code-text-env"
**codetext** package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
```bash
conda activate code-text-env
git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .
```

*Setup for using parser*
Or install via `pypi` package:
```bash
pip install codetext
```

# Getting started

## Build your language
Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
## `codetext` CLI Usage
```bash
codetext [options] [PATH or FILE] ...
```

For example extract any python file in `src/` folder:
```bash
codetext src/ --language Python
```

If you want to store extracted class and function, use flag `--json` and give a path to destination file:
```bash
codetext src/ --language Python --output_file ./python_report.json --json
```

**Options**

```bash
positional arguments:
paths list of the filename/paths.

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANGUAGE, --language LANGUAGE
Target the programming languages you want to analyze.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file (e.g report.json).
--json Generate json output as a transform of the default
output
--verbose Print progress bar
```
**Example**
```
File circle_linkedlist.py analyzed:
==================================================
Number of class : 1
Number of function : 2
--------------------------------------------------
Class summary:
+-----+---------+-------------+
| # | Class | Arguments |
+=====+=========+=============+
| 0 | Node | |
+-----+---------+-------------+
Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| # | Method name | Paramters | Type | Return type |
+=====+===============+=============+========+===============+
| 0 | __init__ | self | | |
| | | data | | |
+-----+---------------+-------------+--------+---------------+
Function analyse:
+-----+-----------------+-------------+--------+---------------+
| # | Function name | Paramters | Type | Return type |
+=====+=================+=============+========+===============+
| 0 | push | head_ref | | Node |
| | | data | Any | Node |
| 1 | countNodes | head | Node | |
+-----+-----------------+-------------+--------+---------------+
```
## Using `codetext` as Python module
### Build your language
`codetext` need tree-sitter language file (i.e `.so` file) to work properly. You can manually compile language ([see more](https://github.com/tree-sitter/py-tree-sitter#usage)) or automatically build use our pre-defined function (the `<language>.so` will saved in a folder name `/tree-sitter/`):
```python
from codetext.utils import build_language
language = 'rust'
build_language(language)

# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
```
## Language Parser
We supported 10 programming languages, namely `Python`, `Java`, `JavaScript`, `Golang`, `Ruby`, `PHP`, `C#`, `C++`, `C` and `Rust`.
### Using Language Parser
Each programming language we supported are correspond to a custome `language_parser`. (e.g Python is [`PythonParser()`](src/codetext/parser/python_parser.py#L11)). `language_parser` take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:
Setup
```python
from codetext.utils import parse_code
raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""
# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node
```
Get all function nodes inside a specific node, use:
Get all function nodes inside a specific node:
```python
from codetext.utils.parser import CppParser
Expand Down Expand Up @@ -105,3 +178,9 @@ class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)
```

# Limitations
`codetext` heavly depends on tree-sitter syntax:
- Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. `codetext` is easily vulnerable by tree-sitter update patch or syntax change in future.

- While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.
Binary file added asset/img/codetext_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/img/codetext_logo_line.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "codetext"
version = "0.0.5"
version = "0.0.7"
authors = [
{ name="Dung Manh Nguyen", email="dungnm.workspace@gmail.com" },
]
Expand All @@ -21,8 +21,12 @@ dependencies = [
"Levenshtein>=0.20",
"langdetect>=1.0.0",
"bs4>=0.0.1",
"tabulate>=0.9.0"
]

[project.urls]
"Homepage" = "https://github.com/AI4Code-Research/CodeText-data"
"Bug Tracker" = "https://github.com/AI4Code-Research/CodeText-data/issues"

[project.scripts]
codetext = "codetext.__main__:main"
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# for preprocessing
tree-sitter
# docstring-parser
tabulate
Levenshtein
langdetect
bs4
Empty file modified src/codetext/__init__.py
100755 → 100644
Empty file.
93 changes: 93 additions & 0 deletions src/codetext/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import os
import sys
import argparse
import pkg_resources

import json
from .codetext_cli import parse_file, print_result, PL_MATCHING


def get_args():
parser = argparse.ArgumentParser(description=f"codetext parser {20*'='}")

parser.add_argument('paths', nargs='*', default=['.'],
help='list of the filename/paths.')
parser.add_argument("--version", action="version",
version=pkg_resources.get_distribution("codetext").version)
parser.add_argument("-l", "--language",
help='''Target the programming languages you want to
analyze.''')
parser.add_argument("-o", "--output_file",
help='''Output file (e.g report.json).
''',
type=str)
parser.add_argument("--json",
help='''Generate json output as a transform of the
default output''',
action="store_true")
parser.add_argument("--verbose",
help='''Print progress bar''',
action="store_true")

return parser.parse_args()


def main():
opt = get_args()

# check args
if opt.json:
if not opt.output_file:
raise ValueError("Missing --output_file")
if opt.language:
if opt.language not in PL_MATCHING.keys():
raise ValueError(
"{language} not supported. Currently support {sp_language}"
.format(language=opt.language,
sp_language=list(PL_MATCHING.keys())))

# check path
for path in opt.paths:
assert os.path.exists(path) == True, "paths is not valid"

if os.path.isdir(path):
files = [os.path.join(path, f) for f in os.listdir(path) \
if os.path.isfile(os.path.join(path, f))]
elif os.path.isfile(path):
files = [path]

if opt.language:
for file in files[:]:
filename, file_extension = os.path.splitext(file)
if file_extension not in PL_MATCHING[opt.language]:
files.remove(file)

output_metadata = {}
for file in files:
filename, file_extension = os.path.splitext(file)

if opt.language == None:
for lang, ext_list in PL_MATCHING.items():
if file_extension in ext_list:
language = lang
break
else:
language = opt.language

output = parse_file(file, language=language)
print_result(
output,
file_name=str(filename).split(os.sep)[-1]+file_extension
)
output_metadata[file] = output

if opt.json:
save_path = opt.output_file
with open(save_path, 'w') as output_file:
json.dump(output_metadata, output_file, sort_keys=True, indent=4)
print(50*'=')
print("Save report to {path}".format(path=save_path))


if __name__ == '__main__':
main()
Empty file modified src/codetext/clean/__init__.py
100755 → 100644
Empty file.
Empty file modified src/codetext/clean/noise_removal.py
100755 → 100644
Empty file.
Loading

0 comments on commit 026a807

Please sign in to comment.