cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
josh_melton
Databricks Employee
Databricks Employee

Introduction

In this article, we’ll walk through how to develop, configure, and deploy a Databricks App for understanding complex code bases. While there’s a lot of freely available, world class software to explore in open source, it’s challenging to grasp a codebase if you’re not familiar with the programming language at hand. There are may industries where this may be a challenge. A lot of manufacturers, for example, rely on complex webs of C code, and database software tends to be written in languages that are difficult to ramp up on quickly such as Rust or C. In this article we’ll use a Databricks App to explore the C code of SQLite, a database that’s deployed on over a trillion devices! For the end to end example, try the repo here.

Instructions for Setup

Before we walk through the solution, here’s the setup process for this Databricks App:

  1. Create a volume: this will be used to store the SQLite source code files
  2. Create a Serving Endpoint for your LLM.  For simplicity, you can use the pay-per-token endpoint for a Databricks Foundation Model API to start.
  3. Clone the repo (if running locally, do step 4 instead): 
    • Get the App code into your Databricks environment by cloning the repo into a git folder in your workspace
  4. Clone the repo to a local directory (if running in workspace, do step 3 instead)
  5. Change the configuration file :
    • edit the app.yaml as needed to match the path to your Volume
    • If developing locally,use the .env.sample as a guide for creating your own .env file
  6. Deploy the App: 
    • Launch the app on Databricks
    • Copy resources from your git folder to the app’s workspace directory; or, if developing locally, run the commands in your App’s ‘Edit in Your IDE’ section to deploy,
    • Add access to the LLM serving endpoint referenced in the app.yaml file’s SERVING_ENDPOINT variable. Be sure you name the resource’s environment variable SERVING_ENDPOINT in all caps (or change the references in the code)
  7. Grant permission for the app to access the volume.  You can find the App’s service principal information on the App’s details page and grant it access like any other user.  The app will require READ VOLUME and WRITE VOLUME permissions.
  8. Launch the app, and click ‘Download Example Files’ to populate them in your Volume.  You can now interact with the App by:
    • Using the search bar to browse variables in the code and view their directed relationship graphs
    • Pass natural language prompts into the application’s chat interface; ask questions or give instructions to obtain AI-enhanced insights about the code in your files.

Solution Overview

Our solution consists of several key components:

  1. Python SDK for the catalog picker
  2. A parser to build a graph of variable dependencies
  3. Graph exploration functionality
  4. Local vector database for retrieval
  5. Llama (Databricks serving endpoint) for LLM serving

Let's dive into each component in detail.

Python SDK for Catalog Picker

We use the Databricks SDK to create a catalog picker, allowing users to navigate through catalogs, schemas, and volumes to select the C files for analysis. The CatalogPicker class in catalog_picker.py handles this functionality. This class creates a user interface for selecting files and manages the interaction with the Databricks workspace. We access the SDK through the WorkspaceClient:

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
catalog_options = [
    {'label': catalog.name, 'value': catalog.name}
    for catalog in w.catalogs.list()
]

This snippet lists the catalog, and similar snippets list the schemas, volumes, and files within the selected volume. Once we’ve selected the file, we can also use the SDK to download the contents of the file to the app for further analysis:

w.files.download(file_path=file_path)
file_contents = response.contents.read()
return file_contents.decode('utf-8')

Code Parsing

The CodeAnalyzer class in code_analyzer.py is responsible for parsing C files and building a graph of variable dependencies. This class uses regular expressions to identify variable declarations, assignments, and usages, constructing a directed graph to represent the relationships between variables. We could also use a more sophisticated library to obtain the full abstract syntax tree, but regex is less susceptible to challenges such as missing dependencies. 

def extract_variables(self, body: str, func: Function):
       """Extract variable declarations from function body"""
       # Updated pattern to catch more variable declarations
       var_pattern = r'\b(?:static\s+)?(?:const\s+)?([a-zA-Z_][a-zA-Z0-9_]*(?:\s*\*)*)\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(?:[=;]|[\[{])'    
       # Also look for function parameters
       param_pattern = r'\b([a-zA-Z_][a-zA-Z0-9_]*(?:\s*\*)*)\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(?:,|\))'
for match in re.finditer(var_pattern, body):
# process the matches

   

We track the dependencies as a directed graph of vertices and edges to be used in the next section.

Graph Exploration

The CodeAnalyzer class also provides methods for exploring the dependency graph. The get_variable_dependencies(), get_variable_info(), and visualize_dependencies() methods allow users to explore the relationships between variables. Since we have a directed graph of the dependencies of all of the code, we can pick out the subgraph that’s related to only the user-selected variable. By default we only include the top 25 most highly related variables, as determined by a centrality-based graph algorithm.

upstream = list(self.dependency_graph.predecessors(node))
upstream_subgraph = self.dependency_graph.subgraph(upstream)
centrality = nx.eigenvector_centrality(upstream_subgraph)
upstream = sorted(upstream, key=lambda x: centrality.get(x, 0), reverse=True)

Local Vector Database

We use a local vector database to enable efficient semantic search through the code. We could pass the text of the entire file, but at times that might be too much context. We can improve cost and latency by passing only the important parts of the code retrieved from the Vector DB. We could use Databricks Vector Search for extremely expansive codebases, but odds are your data isn’t large enough to require a massively scalable solution. The CodeVectorStore class in vector_store.py handles the semantic search functionality, and is called from chat_interface.py:

# vector_store.py
       for i, chunk in enumerate(chunks):
           chunk_id = self._generate_id(f"{file_path}:{i}:{chunk}")
           ids.append(chunk_id)
           documents.append(chunk)
           metadatas.append({
               "file_path": file_path,
               "chunk_index": i,
               "total_chunks": len(chunks)
           })
    
       # Add to collection
       self.collection.add(
           ids=ids,
           documents=documents,
           metadatas=metadatas
       )


# chat_interface.py
search_results = self.code_analyzer.search_code(last_message + code_context)

 

Note that since we’ve parsed the code, we’re retrieving code related to the entire dependency tree of the selected variable in addition to the user’s question. This provides a more in-depth context window to our LLM.

LLM-Powered Explanation

We integrate a Llama model through a Databricks endpoint to provide AI-powered assistance. The ChatInterface class in chat_interface.py manages the interaction with the LLM, passing the user’s question as well as the retrieved code context to the LLM in order to explain the code. We can easily swap in different serving endpoints using the Databricks App resource which is referenced in the app.py file like endpoint_name=os.getenv("SERVING_ENDPOINT"). For simplicity, you can use the pay-per-token endpoint to start. We use the Databricks Python SDK as a simple method for accessing the serving endpoint:

response = self.w.serving_endpoints.query(
                   name=self.endpoint_name,
                   messages=api_messages,
               )

Conclusion

By combining these components, we've created a powerful Databricks App that enables developers to explore and understand complex C codebases like SQLite’s. The app provides an intuitive interface for navigating through the code, visualizing variable dependencies, and leveraging AI-powered assistance for code comprehension. This solution demonstrates the power of Databricks Apps in creating tools for specialized, in-depth analysis of unstructured data (in this case, C files). By utilizing the Databricks SDK for interacting with Databricks resources alongside some AI-generated Python code for analysis, we've built an App for high level code exploration that can be extended to other complex codebases or programming languages. For the end to end example, try the repo here.