Databricks Community

Kratik · ‎07-25-2023

I have code files (.hql) in s3 which was running on Hive. But now, I need to run them on the Databricks cluster.
I can rename the files to .sql and add comment "-- Databricks notebook source" on top of the files to treat them as single cell notebook. But I am not getting way to run that file directly on Databricks.

Also, I don't see any way to bring multiple files from s3 to Databricks workspace as running notebook is supported only in two ways - Workspace or Git.

Tharun-Kumar · ‎07-25-2023

@Kratik
You could use Python to read those files from the directory and get the contents of the file in a variable.

You can use this variable in a spark sql statement like below:

%python
read file
sql_contents = content of the file

spark.sql(sql_contents)

You can enclose this in a method and use it in a for loop to iterate through all the files you have in your directory.

Kratik · ‎07-26-2023

This is one option. But I feel this is not the best way. In case there are some junk characters or special characters, there could be chances of things getting broken or file not read fully.

Also each of my SQL have multiple variables which needs to be passed during runtime as arguments. So I prefer to run them as a file.

Debayan · ‎07-25-2023

Hi, You can connect to S3 and get the files and then run it , following: https://docs.databricks.com/storage/amazon-s3.html
Let us know if this helps or something else is expected?

Please tag @Debayan with your next comment which will notify me. Thanks!

Kratik · ‎07-26-2023

@Debayan

The way which you suggested is for accessing the file on S3 within the notebook. What I am targeting to achieve is execute the file on S3 directly as notebook on Databricks cluster. Obviously to recognise the SQL files as Databricks notebook, I will add the comment --Databricks notebook source on top.

Each of my SQL file have multiple variables which needs to be passed during runtime.

Kratik · ‎08-04-2023

Since I didn't find a way of running SQL code directly from S3, I moved ahead by importing s3 file to Databricks using API by following below :

1. Added -- Databricks notebook source on top of each file so that it is treated as Databricks notebook.

2. Created one notebook with logic to import the sql file to Databricks workspace from s3.

The logic includes:

Read s3 file using boto3
Convert the content of the file to base64 encoding and then make it string.
Pass the base64 encoded string in content
import_params = {
"path": dbx_workspace_path, # Replace with the target Databricks path
"format": "SOURCE",
"language": "SQL",
"content": base64_string,
"overwrite": True
}
Pass this import_params dictionary as payload as post request to api
api/2.0/workspace/import.
url will look like
url = f"https://{dbx_host}/api/2.0/workspace/import"

Note: The dbx_workspace_path should already exist. If it doesn't then you can create it through UI or through API
api/2.0/workspace/mkdirs
This can create full path at once. For example folder1/folder2/folder3. If all the 3 folders are not present and if you provide this path in payload, it will create all together in the given hierarchy.

3. Once the notebook is available in Databricks Workspace, it can be ran via jobs or Databricks Jobs Api

Kratik · ‎08-14-2023

@Debayan you can see this response.

Debayan · ‎08-07-2023

Hi, Sorry missed seeing the post, did the issue got resolved?

Kratik · ‎08-08-2023

Hi @Debayan ,

I have posted the approach I followed in previous comment. Maybe the option to run file directly from S3 is not possible.

Debayan · ‎08-14-2023

Hi Kratik, the recommended practise would be to import the file and do it as in the other case the execution can be slow depending on the network config or the network speed.