topic Re: Is there a Databricks spark connector for java? in Data Engineering

Is there a Databricks spark connector for java?

I-am-Biplab — Wed, 14 May 2025 06:25:03 GMT

Is there a Databricks Spark connector for Java, just like we have for Snowflake (reference of Snowflake spark connector - https://docs.snowflake.com/en/user-guide/spark-connector-use)

Essentially, the use case is to transfer data from S3 to a Databricks table. In the current implementation, I am using Spark to read data from S3 and JDBC to write data to Databricks. But I want to use Spark instead to write data to Databricks.

Re: Is there a Databricks spark connector for java?

Shua42 — Wed, 14 May 2025 16:14:50 GMT

Hey there @I-am-Biplab ,

I'm a bit confused on the ask here. I'm assuming your code isn't running on a databricks cluster, in which case you can use the JDBC url of a running SQL warehouse to write data directly to a Databricks table. See the example code below:

Properties connectionProperties = new Properties(); connectionProperties.setProperty("user", "token"); connectionProperties.setProperty("password", "<DATABRICKS_PERSONAL_ACCESS_TOKEN>"); String jdbcUrl = "jdbc:databricks://<workspace-hostname>:443/default;transportMode=http;ssl=1;httpPath=<sql-warehouse-http-path>"; // Write to Databricks table df.write() .mode("append") // or "overwrite" .jdbc(jdbcUrl, "your_table_name", connectionProperties);

If you are running on a Databricks cluster, you should be able to write directly to a table with:

Dataset<Row> df = spark.read() .format("parquet") // or "csv", "json", etc., depending on your data format .load("s3a://your-bucket/path/to/data"); df.write() .format("delta") .mode("append") // or "overwrite" as per your requirement .saveAsTable("your_catalog.your_schema.your_table");

Let me know if I'm understanding your ask correctly.

Re: Is there a Databricks spark connector for java?

I-am-Biplab — Thu, 15 May 2025 10:03:15 GMT

Thanks @Shua42.

I am using the JDBC URL of a running SQL warehouse to write data directly to a Databricks table from my local machine. But its write performance is inferior. I tried adding `batchSize` and `numPartitions`, but the performance still did not improve at all. Below is the snippet I am using.

spark = SparkSession.builder() .appName("JsonToDatabricksLocalJDBC") .master("local[*]") .config("spark.driver.memory", driverMemory) .config("spark.sql.warehouse.dir", "spark-warehouse-" + System.currentTimeMillis()) .getOrCreate(); Dataset<Row> rawDf = spark.read().schema(expectedSchema).json(inputJsonPath); Dataset<Row> transformedDf = rawDf.select( coalesce(col("app_name"), lit("UnknownApp")).alias("APP_NAME"), coalesce(col("event.event_name"), lit("UnknownEvent")).alias("EVENT_NAME"), coalesce(col("event.event_code"), lit("")).alias("EVENT_CODE"), to_json(col("event.event_attributes")).alias("EVENT_ATTRIBUTES"), to_json(col("event.user_attributes")).alias("USER_ATTRIBUTES"), to_json(col("event.device_attributes")).alias("DEVICE_ATTRIBUTES") ); Properties dfWriteJdbcProperties = new Properties(); dfWriteJdbcProperties.put("user", "token"); dfWriteJdbcProperties.put("password", dbToken); transformedDf.write() .mode(SaveMode.Append) .option("batchsize", String.valueOf(jdbcBatchSize)) .option("numPartitions", String.valueOf(jdbcNumPartitionsForWrite)) .jdbc(jdbcUrl, fullTableNameInDb, dfWriteJdbcProperties);

Please suggest how can I increase its performance, since I have to insert a large volume of data into the Databricks table. Also I have attached the screenshot of the SQL Warehouse

Re: Is there a Databricks spark connector for java?

Shua42 — Thu, 15 May 2025 16:40:18 GMT

Hey @I-am-Biplab ,

If running locally, it is going to be difficult to tune the performance up that much, but there are a few things you can try:

1. Up the partitions and batch size, as much as your machine will allow. Also, running repartition() could help ensure the data is actually partitioned.

2. You can up the SQL Warehouse size in case the throughput on that is the bottleneck. You could up it, write the data and then vertically scale it back down.

3. If you're writing from files, you could copy the files from your local machine to a Volume using the CLI, and then process the data from there which will allow more parallelized writes with Spark.

The bottleneck could be your machine configs, so I would also evaluate your architecture and see if there's any opportunities to stage the data in it's original format in Databricks or cloud storage first, since JDBC isn't best suited for large scale writes.