Databricks Community

brett-aulbaugh · ‎10-06-2025

Oil and gas companies are no strangers to data silos. Despite decades of drilling and technological advances, extremely important well information, carefully recorded in the field, often winds up scattered across proprietary tools, legacy databases, or aging file servers. A great example of this is well log data.

Well logging is the process of recording geological and fluid properties from inside a borehole using specialized equipment, like a logging truck and cable, as shown in the image below.

Despite its value, well log data often remains siloed in proprietary systems and outdated formats, restricting collaboration and analytics. Traditional tools for accessing formats like Log ASCII Standard (LAS) are cumbersome and expensive, keeping vital subsurface insights out of reach.

Modern solutions, such as Databricks AI Functions, provide a path to unlock and scale access to well log data, enabling teams to leverage advanced analytics, cloud integrations, and rapid decision-making across their operations.

Understanding LAS Files: The Foundation of Well Log Data

LAS files serve as the industry standard for storing and transferring well log data within the oil and gas sector. Developed by the Canadian Well Logging Society in the 1990s, LAS files contain crucial geological, geophysical, and petrophysical measurements that provide detailed insights into subsurface formations.

LAS File Structure

LAS files follow a structured ASCII format consisting of several distinct sections:

Version Information (~V): Specifies LAS version (1.2, 2.0, or 3.0)
Well Information (~W): Contains well identification, location coordinates, and drilling details
Curve Information (~C): Defines measurement types, units, and descriptions for each log curve
Parameter Information (~P): Stores additional metadata and computed parameters
Other (~O): Free-form comments and additional context
Data Section (~A): The ASCII data section contains the actual well log measurements in tabular format, with each row representing measurements at a specific depth interval.

Key Insights Hidden in LAS Files

Well log data contains a wealth of geological intelligence that can drive critical business decisions:

Lithology Identification: Rock type classification from gamma ray, neutron, and density measurements
Porosity and Permeability: Reservoir quality indicators essential for production forecasting, and reserves estimates
Hydrocarbon Saturation: Water vs. oil/gas content determination
Structural Information: Formation tops, bedding angles, and geological markers
Completion Optimization: Data to guide hydraulic fracturing and well completion strategies

These curves are usually shown on a log plot, often by themselves, as demonstrated below; however, combining this data with other operational information and applying advanced analytics and modeling can offer significant value to oil and gas operators by not only understanding rock properties but also incorporating additional operational details such as drilling parameters, production performance, and more.

Databricks AI Functions: Bringing AI in Batch to Your Data Lakehouse

Databricks AI Functions represent a paradigm shift in how organizations can apply artificial intelligence directly within their data processing workflows. These functions enable users to leverage the power of large language models (LLMs) and other AI capabilities through simple SQL queries, eliminating the need for complex model deployments or specialized AI infrastructure.

The Databricks platform offers several specialized AI functions designed for different use cases:

ai_translate() - For multilingual text processing
ai_classify() - For categorization and classification tasks
ai_sentiment() - For sentiment analysis
ai_extract() - For information extraction of strings
ai_query() - The general-purpose function for custom AI workflows, and the focus of this blog

The Power of the ai_query Function

The ai_query function stands out as the most flexible option, allowing users to interact with foundation models using custom prompts. This function accepts several key parameters that make it particularly well-suited for processing semi-structured data like LAS files:

Key Parameters:

Model Endpoint: Specifies which model to use (e.g., "databricks-gpt-oss-20b"), including foundation models, custom models, and external models.
Prompt/Request: The instruction text, which can be combined with inputs from the dataset, that guides the model's output.
Response Format: Defines the structure in which the response should be formatted. In our case, a JSON string in a defined scheme.

Setting Up AI Query to Process LAS Files

Reading LAS Files from Unity Catalog Volumes

The first step is storing these files within Unity Catalog Volumes, which can be completely governed and secured to ensure only the appropriate personnel can access these files:

Once files have been uploaded and governed via Unity Catalog, the underlying text can be accessed through the read_files SQL call, which can directly read LAS files in their original format and maintain the semi-structured schema defined above.

This approach treats each LAS file as a complete text object, preserving the structured format that AI models need to understand the geological context and data relationships.

Model Parameters and Configuration

The key to a high-performing extraction model is to be extremely intentional about the ai_query parameters above.

Instruction Design and Input Integration

The power of ai_query lies in combining detailed instructions alongside the raw LAS file content to give it the best chance at successfully extracting the data in the format we need:

This is where the real magic happens! By crafting specific instructions—such as "Identify ALL curve mnemonics from ~CURVE section" or "Sections start with ~ (tilde) character"—users can fine-tune the behavior of the foundation model each time it’s called. For more detailed information on crafting an effective prompt, visit this blog that thoroughly discusses the elements of a good prompt. Whether the goal is to extract high-level summaries or detailed, curve-by-curve analysis, adjusting the prompt text tailors the interpretive lens for each run. This flexibility enables teams to iterate rapidly, test hypotheses, and deploy workflows that adapt to evolving requirements.

Schema Enforcement through Return Types

In addition to the instructions provided in the request, the responseFormat parameters enable the enforcement of structured output. For this example, we aim to return a JSON object that adheres to a specific schema.

This ensures consistent, parseable results that can be easily integrated into downstream analytics workflows and eliminates the need for complex post-processing of AI responses. There are other parameters that can be defined within this section, such as failOnError, which allows users to control how the response is handled if it does not meet the desired output.

Creating Reusable SQL Functions

With the ai_query function call completely filled out, we can now package this into a reusable SQL function that can be easily leveraged within Lakeflow Declarative Pipelines, notebooks, or SQL editors. These functions are stored within the Unity Catalog, allowing them to be governed and secured, with access restricted to only certain company personnel. The complete code is presented below. This function accepts a text input, in this case our LAS file coming in from a read_files() call, and returns an appropriately formatted JSON object.

CREATE OR REPLACE FUNCTION {catalog}.{schema}.extract_las_info(input STRING)
RETURNS STRUCT<
  well: STRING,
  null_value: DOUBLE,
  api_number: STRING,
  curve_data: ARRAY<STRUCT<
    depth: DOUBLE,
    gamma_ray: DOUBLE,
    delta_t: DOUBLE,
    resistivity: DOUBLE,
    sp: DOUBLE
  >>
>
RETURN from_json(ai_query(
      endpoint => 'databricks-gpt-oss-120b',
      request =>
        CONCAT(
          '''
        You are an expert drilling engineer and well log analyst with 15+ years of experience parsing LAS (Log ASCII Standard) files. 
        
        CRITICAL INSTRUCTIONS:

        1. This is a LAS file containing well log data from oil/gas drilling operations
        2. Parse ALL sections methodically: ~VERSION, ~WELL, ~CURVE, ~PARAMETER (if present), ~ASCII data
        3. PRESERVE original depth values - do NOT interpolate, average, or modify any measurements
        4. Extract curve data for ALL available log types (not just the schema examples)
        5. Handle null values properly (commonly -999.25, -9999, or as specified in NULL field)
        6. Maintain data precision as recorded in the file
        7. If any section is missing or corrupted, note this in the response     

        WELL INFORMATION EXTRACTION:
        - Parse ~WELL section for: WELL name, COMP (company), FLD (field), LOC (location), 
          CNTY (county), STAT (state), CTRY (country), STRT (start depth), STOP (stop depth), 
          STEP (step interval), NULL (null value), UWI/API (well identifier)     

        CURVE DATA EXTRACTION:
        - Identify ALL curve mnemonics from ~CURVE section (e.g., DEPT, GR, NPHI, RHOB, RT, SP, etc.)
        - Extract complete depth series with ALL available log curves
        - Common curve types include: Gamma Ray (GR), Neutron Porosity (NPHI), Bulk Density (RHOB), 
          Resistivity (various: RT, ILD, MSFL), Spontaneous Potential (SP), Photoelectric Factor (PEF),
          Caliper (CALI), Delta-T/Sonic (DT), and many others
        - DO NOT assume only specific curves exist - extract whatever is available

        DATA QUALITY CHECKS:
        - Verify depth progression is logical (increasing or decreasing consistently)
        - Flag any depth gaps or overlaps
        - Note data density and any sparse sections
        - Identify outlier values that may indicate data quality issues


        PARSING RULES:
        - Sections start with ~ (tilde) character
        - In header sections, data format is: MNEM.UNIT DATA :DESCRIPTION
        - ASCII data section has space-separated columns matching curve order
        - Handle wrapped lines if WRAP.YES is specified
        - Respect the files specified null value for missing data
        - Return a complete JSON object with this exact structure, including ALL curves found in the file:
        ''',
          input
        ),
        returnType => 'JSON',
      responseFormat =>
        '{
        "type": "json_schema",
        "json_schema": {
          "name": "las_extraction",
          "schema": {
            "type": "object",
            "properties": {
              "well": {"type": "string"},
              "null_value": {"type": "number"},
              "api_number": {"type": "string"},
              "curve_data": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "depth": {"type": "number"},
                    "gamma_ray": {"type": "number"},
                    "delta_t": {"type": "number"},
                    "resistivity": {"type": "number"},
                    "sp": {"type": "number"}
                  }
                }
              }
            }
          },
          "strict": true
        }
      }',
      failOnError => false
    ).result,
        'STRUCT<well:STRING,
        null_value:DOUBLE,
        api_number:STRING,
        curve_data:ARRAY<STRUCT<depth:DOUBLE, gamma_ray:DOUBLE, delta_t:DOUBLE, resistivity:DOUBLE, sp:DOUBLE>>>'
      );

From JSON to Structure: Operationalizing AI Output

With your custom AI function now outputting actionable JSON, turning insights from each LAS file into analytics-ready data is simple. By applying the function to files stored in Unity Catalog Volumes and using the explode operation, “Curve_data” arrays are easily expanded into tabular columns, ready for storage in your Lakehouse and for use in downstream analyses.

These custom SQL functions aren’t just convenience tools, they underpin robust, production-grade analytics. Run ad hoc queries, trigger them automatically with Lakeflow pipelines when each new file lands, or embed them in batch jobs. Centralizing AI-powered geological interpretation in this way gives teams both operational strength and agility, eliminating the need to rewrite business logic for every new workflow.

Advanced Analytics: Transforming Raw Logs into Business Intelligence

Once LAS files are processed through the AI pipeline, the extracted insights become the foundation for advanced analytics that drive significant business value across exploration, development, and production operations. Some of those analytics projects include the following:

Reservoir mapping and characterization to locate sweet spots and guide drilling strategies.
Automated lithofacies classification using machine learning for faster geology interpretation.
Completion optimization by integrating log and production data to improve well performance.
Production prediction models built on cleaned well logs for better output forecasting.
Real-time analytics for geosteering, pressure prediction, and quick risk identification during drilling.
Cross-basin benchmarking of reservoir properties to compare wells and identify best practices.

Conclusion

This revolutionary approach transforms how oil and gas companies extract value from their subsurface data assets. By breaking free from proprietary software silos and leveraging the power of Databricks AI Functions, organizations can democratize access to geological insights, accelerate decision-making, and unlock new opportunities for operational excellence.

The future of well log analysis lies not in expensive, specialized software packages, but in open, AI-powered platforms that put the power of advanced analytics directly into the hands of geoscientists, engineers, and data teams. With Databricks ai_query processing LAS files at scale, that future is already here.