3 weeks ago
Hello,
I was wondering if there was sample code indicating how a java program might leverage Databricks Connect to query the table in the Free Edition of Databricks?
I would like to use Connect as I am trying to avoid JDBC and its overhead and thought I might do better by creating dataframes and then leveraging connect to write them to databricks in parquet based files. I note that Databricks connect claims to support Java in some places but the documentation focuses on... Python, R and Scala.
https://docs.databricks.com/aws/en/dev-tools/databricks-connect/
I saw there use to be a standalone... which is what I believe I wanted, but it looks like it is to be deprecated.
I am new to databricks and its concepts but am familiar with ICEBERG (in which I'd simply use the Iceberg Java API's leveraging a file appender and then the Catalog API to register my parquet files with the manifest). What is the equivalent here to write directly out parquet in parallel and then register? (presumably leveraging their spark compute to do it)
3 weeks ago
Hi — welcome to Databricks! Unfortunately, Databricks Connect v2 (DBR 13.3+) does not support Java — it only supports Python, Scala, and R. The legacy v1 did support Java, but it's been deprecated and is end-of-support.
That said, here are your options as a Java developer:
Since Scala runs on the JVM, you can call the Databricks Connect Scala APIs from Java. This gives you full DataFrame read/write support:
// Scala — callable from Java via JVM interop
import com.databricks.connect.DatabricksSession
import org.apache.spark.sql.types._
val spark = DatabricksSession.builder().getOrCreate()
// Write a DataFrame to a table
val df = spark.read.table("samples.nyctaxi.trips")
df.limit(5).show()
// Create and write your own DataFrame
val schema = StructType(Seq(
StructField("id", IntegerType, false),
StructField("name", StringType, false)
))
val data = Seq(Row(1, "Alice"), Row(2, "Bob"))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.write.saveAsTable("my_catalog.my_schema.my_table")
Add the Maven dependency:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-connect</artifactId>
<version>15.4.0</version> <!-- match your DBR version -->
</dependency>
See: Databricks Connect Scala Examples
If you want to stay in pure Java, the Databricks SDK for Java lets you:
This is closer to the Iceberg pattern you described (write files, then register):
import com.databricks.sdk.WorkspaceClient;
WorkspaceClient w = new WorkspaceClient();
// Upload a parquet file to a Volume
w.files().upload("/Volumes/my_catalog/my_schema/my_volume/data.parquet", inputStream);
// Then run SQL to create a table from the file
// (via Statement Execution API or JDBC for the SQL part)
Maven dependency:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-sdk-java</artifactId>
<version>0.2.0</version> <!-- use latest from Maven Central -->
</dependency>
I know you want to avoid JDBC, but it's worth noting that Databricks JDBC supports Arrow-based bulk ingestion which significantly reduces the overhead compared to traditional row-by-row JDBC inserts. It may be faster than you expect.
Databricks Connect requires a cluster or serverless compute with Spark Connect enabled. The Free Edition (Community Edition) has limited compute options, so Databricks Connect may not work there. The SDK + SQL approach (Option 2) or JDBC (Option 3) are more likely to work on the free tier.
Docs:
Hope that helps point you in the right direction!
3 weeks ago
Hi — welcome to Databricks! Unfortunately, Databricks Connect v2 (DBR 13.3+) does not support Java — it only supports Python, Scala, and R. The legacy v1 did support Java, but it's been deprecated and is end-of-support.
That said, here are your options as a Java developer:
Since Scala runs on the JVM, you can call the Databricks Connect Scala APIs from Java. This gives you full DataFrame read/write support:
// Scala — callable from Java via JVM interop
import com.databricks.connect.DatabricksSession
import org.apache.spark.sql.types._
val spark = DatabricksSession.builder().getOrCreate()
// Write a DataFrame to a table
val df = spark.read.table("samples.nyctaxi.trips")
df.limit(5).show()
// Create and write your own DataFrame
val schema = StructType(Seq(
StructField("id", IntegerType, false),
StructField("name", StringType, false)
))
val data = Seq(Row(1, "Alice"), Row(2, "Bob"))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.write.saveAsTable("my_catalog.my_schema.my_table")
Add the Maven dependency:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-connect</artifactId>
<version>15.4.0</version> <!-- match your DBR version -->
</dependency>
See: Databricks Connect Scala Examples
If you want to stay in pure Java, the Databricks SDK for Java lets you:
This is closer to the Iceberg pattern you described (write files, then register):
import com.databricks.sdk.WorkspaceClient;
WorkspaceClient w = new WorkspaceClient();
// Upload a parquet file to a Volume
w.files().upload("/Volumes/my_catalog/my_schema/my_volume/data.parquet", inputStream);
// Then run SQL to create a table from the file
// (via Statement Execution API or JDBC for the SQL part)
Maven dependency:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>databricks-sdk-java</artifactId>
<version>0.2.0</version> <!-- use latest from Maven Central -->
</dependency>
I know you want to avoid JDBC, but it's worth noting that Databricks JDBC supports Arrow-based bulk ingestion which significantly reduces the overhead compared to traditional row-by-row JDBC inserts. It may be faster than you expect.
Databricks Connect requires a cluster or serverless compute with Spark Connect enabled. The Free Edition (Community Edition) has limited compute options, so Databricks Connect may not work there. The SDK + SQL approach (Option 2) or JDBC (Option 3) are more likely to work on the free tier.
Docs:
Hope that helps point you in the right direction!
3 weeks ago
Thanks for the great answer!
I am looking for the most performant solution. If I choose the SDK method of writing the data will that then have to read information about the files when I register? Or is the SDK that writes keeping track of the metadata it needs when the files are committed? Ie. I don't want to have to treat the files like they are generic files on an S3 and then register them. With Iceberg the file appender I use to write keeps track of the metadata I need to actually register with, so I don't need to open the files to register them.
If I use the connect I assume that's handled. Which is the faster option? I wish to potentially be able to write out several files for a given table at the same time (if I have a lot of insert data) and then register those files together. I find I can almost tripple performance if I have the right kind of target file system with Iceberg when I do this (basically perform intra-table parallelism on write).