cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is there a Databricks spark connector for java?

I-am-Biplab
New Contributor II

Is there a Databricks Spark connector for Java, just like we have for Snowflake (reference of Snowflake spark connector - https://docs.snowflake.com/en/user-guide/spark-connector-use)

Essentially, the use case is to transfer data from S3 to a Databricks table. In the current implementation, I am using Spark to read data from S3 and JDBC to write data to Databricks. But I want to use Spark instead to write data to Databricks.

3 REPLIES 3

Shua42
Databricks Employee
Databricks Employee

Hey there @I-am-Biplab ,

I'm a bit confused on the ask here. I'm assuming your code isn't running on a databricks cluster, in which case you can use the JDBC url of a running SQL warehouse to write data directly to a Databricks table. See the example code below:

        Properties connectionProperties = new Properties();
        connectionProperties.setProperty("user", "token");
        connectionProperties.setProperty("password", "<DATABRICKS_PERSONAL_ACCESS_TOKEN>");

        String jdbcUrl = "jdbc:databricks://<workspace-hostname>:443/default;transportMode=http;ssl=1;httpPath=<sql-warehouse-http-path>";

        // Write to Databricks table
        df.write()
                .mode("append") // or "overwrite"
                .jdbc(jdbcUrl, "your_table_name", connectionProperties);

If you are running on a Databricks cluster, you should be able to write directly to a table with:

Dataset<Row> df = spark.read()
    .format("parquet") // or "csv", "json", etc., depending on your data format
    .load("s3a://your-bucket/path/to/data");

df.write()
    .format("delta")
    .mode("append") // or "overwrite" as per your requirement
    .saveAsTable("your_catalog.your_schema.your_table");

 Let me know if I'm understanding your ask correctly.

I-am-Biplab
New Contributor II

Thanks @Shua42.

I am using the JDBC URL of a running SQL warehouse to write data directly to a Databricks table from my local machine. But its write performance is inferior. I tried adding `batchSize` and `numPartitions`, but the performance still did not improve at all. Below is the snippet I am using.

spark = SparkSession.builder()
    .appName("JsonToDatabricksLocalJDBC")
    .master("local[*]")
    .config("spark.driver.memory", driverMemory)
    .config("spark.sql.warehouse.dir", "spark-warehouse-" + System.currentTimeMillis())
    .getOrCreate();

Dataset<Row> rawDf = spark.read().schema(expectedSchema).json(inputJsonPath);

Dataset<Row> transformedDf = rawDf.select(
        coalesce(col("app_name"), lit("UnknownApp")).alias("APP_NAME"),
        coalesce(col("event.event_name"), lit("UnknownEvent")).alias("EVENT_NAME"),
        coalesce(col("event.event_code"), lit("")).alias("EVENT_CODE"),
        to_json(col("event.event_attributes")).alias("EVENT_ATTRIBUTES"),
        to_json(col("event.user_attributes")).alias("USER_ATTRIBUTES"),
        to_json(col("event.device_attributes")).alias("DEVICE_ATTRIBUTES")
    );

Properties dfWriteJdbcProperties = new Properties();
    dfWriteJdbcProperties.put("user", "token");
    dfWriteJdbcProperties.put("password", dbToken);

transformedDf.write()
    .mode(SaveMode.Append)
    .option("batchsize", String.valueOf(jdbcBatchSize))
    .option("numPartitions", String.valueOf(jdbcNumPartitionsForWrite))
    .jdbc(jdbcUrl, fullTableNameInDb, dfWriteJdbcProperties);

Please suggest how can I increase its performance, since I have to insert a large volume of data into the Databricks table. Also I have attached the screenshot of the SQL Warehouse

Shua42
Databricks Employee
Databricks Employee

Hey @I-am-Biplab ,

If running locally, it is going to be difficult to tune the performance up that much, but there are a few things you can try:

1. Up the partitions and batch size, as much as your machine will allow. Also, running repartition() could help ensure the data is actually partitioned.

2. You can up the SQL Warehouse size in case the throughput on that is the bottleneck. You could up it, write the data and then vertically scale it back down.

3. If you're writing from files, you could copy the files from your local machine to a Volume using the CLI, and then process the data from there which will allow more parallelized writes with Spark.

The bottleneck could be your machine configs, so I would also evaluate your architecture and see if there's any opportunities to stage the data in it's original format in Databricks or cloud storage first, since JDBC isn't best suited for large scale writes.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now