-
- We have several functions under
logic
, including: create_embedding
which leverages a window function for context and cosine similarity.convert_embedding
reshapes the embeddings to be passed to the clustering function which utilizes Agglomerative Clustering.infinite_gpt
selects prompts depending on the scenario.main.py
hosts the web serverroutes.py
we have defined the routing of different addresses.
- There are few helper modules
- In process_file.py there is
process_file
function which reads through file and gives embeddings of the input text. - Traversal: We traverse all folders and files, passing a single clustered embedding to the AI, hosted on a local server using FastAPI.
- We have several functions under
- Manages what should be uploaded and what should not.
- Process: We use a 0breadth-first search (BFS)(https://www.geeksforgeeks.org/breadth-first-search-or-bfs-for-a-graph/) to systematically explore all directories and files, ensuring all files are processed in a structured manner.
- Techniques:
- BFS Traversal: Efficiently adapted to traverse file directories. traverse.py
- File Filtering: Excludes non-relevant files based on extensions and names, so our model is only given the files containing text
- Use of Queue: we used queue to make sure all directories are corvered
- Next step Now we after we have gathered all the embeddings, we reshape it and then now we pad it with zeros to ensure uniform dimensions so we can pass it to the clustering algorithm, now we have the indices list. Now with the indices list we can send prompts to the AI in chunks decided by the indices list
-
Meaning: Embeddings are easier for computers to understand as the words are represented as multi-dimensional arrays or matrices We generate embeddings for each code file using a pre-trained language model, such as BERT (Bidirectional Encoder Representations from Transformers). This step converts the code into a numerical representation that captures its semantic meaning.
-
Process: Generate embeddings for each code file using a pre-trained BERT model.
-
Techniques:
- Transformer Models: BERT encapsulates both syntactic and semantic properties of code.
- Windowed Embedding Generation: For large files, the code is split into overlapping windows to maintain contextual integrity. Codebert has token limit of 512 tokens
- Process: We handle large code files by dividing them into smaller, manageable chunks. Each chunk is processed independently, and their embeddings are aggregated to represent the entire file.
- Techniques:
- Sliding Window Approach: Ensures complete coverage of the code file. Overlapping windows help preserve the context across boundaries, which is crucial for understanding the code's overall structure.
- Embedding Aggregation: Techniques like averaging or concatenation are used to retain comprehensive context.
- Process: Similar code embeddings are grouped using agglomerative clustering.
- Techniques:
- Agglomerative Clustering:
- Computing distance between every pair of objects.
- Using linkage to group objects into hierarchical cluster tree, based on the distance. Objects/clusters that are in close proximity are linked together using the linkage function.
- Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.
- Cosine Similarity: We use cosine similarity as the distance metric for clustering, which is effective in high-dimensional spaces and helps in accurately measuring the similarity between code embeddings.
- Process: Once clustered, documentation is generated by sending code snippets to a language model like GPT-3.5.
- Techniques:
- Prompt Engineering: Guides the language model to generate structured documentation.
- Batch Processing: Processes clusters in single prompts to improve efficiency.
- Mock Responses: Tests the system without real API costs.
According to me the techinques used are
- Windowed Embedding Generation: Preserves context across text for accurate embeddings and for the token limit of 512.
- Hierarchical Clustering for Context Maintenance: Maintains hierarchical relationships within the codebase.
- Prompt Engineering for Structured Documentation: Allows customization of documentation needs.
- We take the clustered data with distances, find the total number of nodes, and then using that we create the linkage matrix. The code is directly driven from documentation given on scikit
- The Dendrograms are saved in the zip folder with documentation, when the document generation is done.
- Reason for choosing: Since dendrograms are generally associated with hierarchical data, therefore Agglomerative clustering is a good choice for clustering
- Limitations The dendrograms only give files clustered, improvements would include dendrogram nodes of the file content according to functions or different code sections. - Improvements could have been done by allowing dendrogram generation as a separate module to avoid document generation when not required.
- Code analytics: I have used Ast(abstract syntax trees) to derive the detailed code structure, functions classes methods; functions and methods are distinguished, and then these relations are in knowledge graph. code analytics
- There were few more ways to do this, a simple prompt would have made easy but with the help of modules we can save ourselves some tokens in the Api, and we can reuse this for the creation of knowledge graphs.
- Limitations this doesnt handle error (a simple try and except before calling the function) if the code in the file contains any error prone code, it will break for now.
Node Color | Meaning |
---|---|
Yellow | Folder |
Orange | File |
Red | Class |
Blue | Method part of a Class |
Green | Function |
- For the knowledge graph addition, I used the non-clustered data as it was easier to plot, I used the pyvis module to plot the map using the analytics obtained from the previous section. It will also be saved with the output docs zip file available during the document generation. The knowledge graphs are interactive and give a proper insight into the code structure, they are fun to play around with. As mentioned in the assignment the next thing that could be done is to add code retrieval which enhances the ability to find code because we the structure is well defined.
Improvements could be using the clustering distance to use instead of fix length, but there were some formatting issues that need to be worked on, not many tutorials availaible for py network.
Limitations And Potential Solutions
- Dependency on Pre-trained Models:
- The application relies heavily on pre-trained models like BERT for embeddings and GPT-3.5 for documentation generation.
- This dependency may limit the customization and accuracy of embeddings and documentation due to model biases and lack of domain-specific knowledge.
- As the problem statement suggested exploring more mutli modal approach can help.
- Static Code Analysis
- Static analysis does not capture dynamic behavior, such as runtime dependencies and interactions between modules during execution.
- Dynamic and Static Analysis Integration: Integrate dynamic analysis techniques, such as profiling, to capture runtime behavior in addition to static analysis. This provides comprehensive insights into the codebase and enhances documentation with runtime information. This also allows to improve performance of the code,
- No logging/ tracing
- Tracing involves recording the sequence of executed instructions or function calls in a program. It provides a detailed log of the program’s execution flow, including the order of executed functions, their entry and exit points, and the values of variables at different points in time.
- logger module in python can help.
- Limited User customisation
- We can add a checkbox type UI to select what the user wants, want generate documentation using this.
- Memory Management
- The current approach of storing and processing embeddings for numerous files consumes significant memory, particularly due to padding embeddings to the same size.
- Efficient Memory Management and Streaming Processing: Use efficient data structures (e.g., sparse matrices) and techniques like dimensionality reduction (e.g., PCA )
- Limited Error Handling
- The current system may not handle edge cases well, such as corrupted files, unsupported languages, or incomplete code snippets, potentially leading to failures or suboptimal documentation.
- Enhanced Error Handling and Robustness: Implement comprehensive validation and error handling mechanisms
- Scalability and Memory Management
- The current approach of storing and processing embeddings for numerous files consumes significant memory
- Techniques like dimensionality reduction(PCA, KCA) to store embeddings, reducing memory usage.
- Implement streaming processing for embedding generation and processing to minimize memory usage by processing data in a streaming fashion.
- Instead of BFS, we can maybe use some parallel processing techniques.
- BERT Variants: Explore models like RoBERTa, ALBERT, and DistilBERT which are variations of BERT as mentioned in the paper
- Codex: Specifically designed for coding tasks, Codex powers GitHub Copilot and can provide context-aware code suggestions and documentation.
- GraphCodeBERT: Extends CodeBERT by incorporating data-flow information, which can improve understanding and generation of code structures.
- Memory Networks (MemNets) are neural networks that have a memory component which they can read from and write to
- allows long term memorization, as told in the sequence modelling video, helps the model to know when to use is / are "the dogs ... (long sentence)"