Databricks Community

I-am-Biplab · ‎05-13-2025

Is there a Databricks Spark connector for Java, just like we have for Snowflake (reference of Snowflake spark connector - https://docs.snowflake.com/en/user-guide/spark-connector-use)

Essentially, the use case is to transfer data from S3 to a Databricks table. In the current implementation, I am using Spark to read data from S3 and JDBC to write data to Databricks. But I want to use Spark instead to write data to Databricks.

Shua42 · ‎05-14-2025

Hey there @I-am-Biplab ,

I'm a bit confused on the ask here. I'm assuming your code isn't running on a databricks cluster, in which case you can use the JDBC url of a running SQL warehouse to write data directly to a Databricks table. See the example code below:

        Properties connectionProperties = new Properties();
        connectionProperties.setProperty("user", "token");
        connectionProperties.setProperty("password", "<DATABRICKS_PERSONAL_ACCESS_TOKEN>");

        String jdbcUrl = "jdbc:databricks://<workspace-hostname>:443/default;transportMode=http;ssl=1;httpPath=<sql-warehouse-http-path>";

        // Write to Databricks table
        df.write()
                .mode("append") // or "overwrite"
                .jdbc(jdbcUrl, "your_table_name", connectionProperties);

If you are running on a Databricks cluster, you should be able to write directly to a table with:

Dataset<Row> df = spark.read()
    .format("parquet") // or "csv", "json", etc., depending on your data format
    .load("s3a://your-bucket/path/to/data");

df.write()
    .format("delta")
    .mode("append") // or "overwrite" as per your requirement
    .saveAsTable("your_catalog.your_schema.your_table");

Let me know if I'm understanding your ask correctly.

I-am-Biplab · ‎05-15-2025

Thanks @Shua42.

I am using the JDBC URL of a running SQL warehouse to write data directly to a Databricks table from my local machine. But its write performance is inferior. I tried adding `batchSize` and `numPartitions`, but the performance still did not improve at all. Below is the snippet I am using.

spark = SparkSession.builder()
    .appName("JsonToDatabricksLocalJDBC")
    .master("local[*]")
    .config("spark.driver.memory", driverMemory)
    .config("spark.sql.warehouse.dir", "spark-warehouse-" + System.currentTimeMillis())
    .getOrCreate();

Dataset<Row> rawDf = spark.read().schema(expectedSchema).json(inputJsonPath);

Dataset<Row> transformedDf = rawDf.select(
        coalesce(col("app_name"), lit("UnknownApp")).alias("APP_NAME"),
        coalesce(col("event.event_name"), lit("UnknownEvent")).alias("EVENT_NAME"),
        coalesce(col("event.event_code"), lit("")).alias("EVENT_CODE"),
        to_json(col("event.event_attributes")).alias("EVENT_ATTRIBUTES"),
        to_json(col("event.user_attributes")).alias("USER_ATTRIBUTES"),
        to_json(col("event.device_attributes")).alias("DEVICE_ATTRIBUTES")
    );

Properties dfWriteJdbcProperties = new Properties();
    dfWriteJdbcProperties.put("user", "token");
    dfWriteJdbcProperties.put("password", dbToken);

transformedDf.write()
    .mode(SaveMode.Append)
    .option("batchsize", String.valueOf(jdbcBatchSize))
    .option("numPartitions", String.valueOf(jdbcNumPartitionsForWrite))
    .jdbc(jdbcUrl, fullTableNameInDb, dfWriteJdbcProperties);

Please suggest how can I increase its performance, since I have to insert a large volume of data into the Databricks table. Also I have attached the screenshot of the SQL Warehouse

Shua42 · ‎05-15-2025

Hey @I-am-Biplab ,

If running locally, it is going to be difficult to tune the performance up that much, but there are a few things you can try:

1. Up the partitions and batch size, as much as your machine will allow. Also, running repartition() could help ensure the data is actually partitioned.

2. You can up the SQL Warehouse size in case the throughput on that is the bottleneck. You could up it, write the data and then vertically scale it back down.

3. If you're writing from files, you could copy the files from your local machine to a Volume using the CLI, and then process the data from there which will allow more parallelized writes with Spark.

The bottleneck could be your machine configs, so I would also evaluate your architecture and see if there's any opportunities to stage the data in it's original format in Databricks or cloud storage first, since JDBC isn't best suited for large scale writes.

Databricks Community

Is there a Databricks spark connector for java?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples