cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to run sql query/notebook present on S3 on databricks cluster?

Kratik
New Contributor III

I have code files (.hql) in s3 which was running on Hive. But now, I need to run them on the Databricks cluster. 
I can rename the files to .sql and add comment "-- Databricks notebook source" on top of the files to treat them as single cell notebook. But I am not getting way to run that file directly on Databricks.

Also, I don't see any way to bring multiple files from s3 to Databricks workspace as running notebook is supported only in two ways - Workspace or Git. 

10 REPLIES 10

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Kratik 
You could use Python to read those files from the directory and get the contents of the file in a variable.

You can use this variable in a spark sql statement like below:

%python
read file
sql_contents = content of the file

spark.sql(sql_contents)

You can enclose this in a method and use it in a for loop to iterate through all the files you have in your directory.

Kratik
New Contributor III

This is one option. But I feel this is not the best way. In case there are some junk characters or special characters, there could be chances of things getting broken or file not read fully.

 

Also each of my SQL have multiple variables which needs to be passed during runtime as arguments. So I prefer to run them as a file. 

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, You can connect to S3 and get the files and then run it , following: https://docs.databricks.com/storage/amazon-s3.html
Let us know if this helps or something else is expected? 

 Please tag @Debayan  with your next comment which will notify me. Thanks!

Kratik
New Contributor III

@Debayan 

The way which you suggested is for accessing the file on S3 within the notebook. What I am targeting to achieve is execute the file on S3 directly as notebook on Databricks cluster. Obviously to recognise the SQL files as Databricks notebook, I will add the comment --Databricks notebook source on top. 

 

Each of my SQL file have multiple variables which needs to be passed during runtime.

Kratik
New Contributor III

Since I didn't find a way of running SQL code directly from S3, I moved ahead by importing s3 file to Databricks using API by following below :

1. Added -- Databricks notebook source on top of each file so that it is treated as Databricks notebook. 

2. Created one notebook with logic to import the sql file to Databricks workspace from s3. 

The logic includes:

  • Read s3 file using boto3
  • Convert the content of the file to base64 encoding and then make it string.
  • Pass the base64 encoded string in content 

  • import_params = {
        "path": dbx_workspace_path,  # Replace with the target Databricks path
        "format": "SOURCE",
        "language": "SQL",
        "content": base64_string,
        "overwrite": True
    }

    Pass this import_params dictionary as payload as post request to api 

    api/2.0/workspace/import. 
    url will look like 
    url = f"https://{dbx_host}/api/2.0/workspace/import"
     
    Note: The dbx_workspace_path should already exist. If it doesn't then you can create it through UI or through API 
    api/2.0/workspace/mkdirs
    This can create full path at once. For example folder1/folder2/folder3. If all the 3 folders are not present and if you provide this path in payload, it will create all together in the given hierarchy.
     
    3. Once the notebook is available in Databricks Workspace, it can be ran via jobs or Databricks Jobs Api
     

 

Kratik
New Contributor III

@Debayan  you can see this response.

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Sorry missed seeing the post, did the issue got resolved? 

Kratik
New Contributor III

Hi @Debayan , 

I have posted the approach I followed in previous comment. Maybe the option to run file directly from S3 is not possible.

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi Kratik, the recommended practise would be to import the file and do it as in the other case the execution can be slow depending on the network config or the network speed. 

Kratik
New Contributor III

@Debayan That's what I mentioned in my previous response in detail. Import the file in workspace and execute. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group