Skip to content

DocumaticAI/CodeClarity

Repository files navigation

CodeClarity - Code Embeddings Made Easy

by Documatic. Sign-up for free to get a more efficient codebase in 5 minutes.

About CodeClarity

This repository contains [CodeClarity] a lightweight app for creating contextual embeddings of source code in a format that is optimized and designed with code search and understanding. in mind. This repository is part of a larger application providing a free exploration into the documatic codesearch tools capabilities.

Installation

We recommend Python 3.7 or higher, PyTorch 1.6.0 or higher and transformers v4.6.0 or higher. The code does not work with Python 2. Install with pip

Install the codclarity with pip for nightly build versions (NOTE: this package is under constant development and the pip package is subject to regular change):

pip install codeclarity

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

First download a pretrained code model.

from codeclarity import CodeEmbedder

model = CodeEmbedder(base_model = "microsoft/unixcoder-base")

Then provide some code snippits to the model. These can be full functions that could be parsed by an Abstract Syntax Tree, or small snippits.

code_snippits = ['def read_csvs(dir) : return [pd.read_csv(fp) for fp in os.listdir(dir)]',
    "def set_pytorch_device(): return torch.device('cuda') if torch.cuda.is_available() else 'cpu", 
    'read file from disk into pandas dataframe']
code_embeddings = model.encode(code_snippits)

And that's it! We now have a list of returned embeddings of default type numpy array.

for code, embedding in zip(code_snippits, code_embeddings):
    print("Sentence:", code)
    print("Embedding:", embedding)
    print("")

API Drop in

This project additionally impliments a docker container that serves a python REST api with the package running in it to serve a given model specified by the user as an environment variable. This is to allow those lacking data science backgrounds to serve models with code understanding capabilities and to give an example of how this package may be used.

To build a docker container from source, run the following

git clone https://github.com/DocumaticAI/code-embeddings-api.git 
cd docker_api && bash ./setup.sh

Please ensure that the ./setup.sh has suitable permissions by running chmod on the file if needed.

Equally, to run the API outside the docker container, just clone the repository, navigate to the API folder and run the API python file directly after exporting an environment variable with the base_model. Note, this directly runs a uvicorn webserver, and this is only suitable for development usecases: this should not be done in a production environment.

git clone https://github.com/DocumaticAI/code-embeddings-api.git 
cd docker_api/app
python predictor.py

Pre-Trained Models

We provide implimentations of a range of code embedding models that are currently the state of the art in various tasks, including code semantic search, code clustering, code program detection, synthesis and more. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: CodeEmbedder('model_name').

Currently supported models

Internals of docker API

CodeClarity is designed to be a simple, modular dockerized python application that can be used to optain dense vector representations of natrual language code queries, and source code jointly to empower semantic search of codebases.

The docker API is comprised of a lightweight, async fastapi application running on a guicorn webserver. On startup, any of the supported models will be injected into the container, converted to an optimized serving format (coming soon!), and run on a REST API.

CodeClarity automatically handles checking for supported languages for code models, dynamic batching of both code and natrual language snippits in an asyncronous manner along with prudent model format conversions.

Publications

The following papers are implimented or used heavily in this repo and this project would not be possible without their work:

Contributing

If you have a bug report, question, or feature request, please open an issue. If you'd like to contribute a feature or fix, feel free to open a pull request. Some points on coding style to help your PR get merged:

  • We use black code style
  • Sort imports with isort (black profile)

About Documatic

Documatic is the company that delivers a more efficient codebase in 5 minutes. While you focus on coding, Documatic handles, creates and deploys the documentation so it's always up to date.

Getting help

If you have any questions about, feedback for or a problem with Codeclarity:

License

This project is ussed under the Apache 2 License. Read the license for complete terms.

About

Making Dense Vectors of Code and Language Easy!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages