Hi
You’ve optimised the embedding side really nicely already, batching in mapPartitions and creating one Azure client per partition is exactly what we recommend.
For 35k rows, if embedding is fast but the Delta write/commit is slow, it’s almost always due to:
too many small output files, and/or
extra passes over the DataFrame, and/or
a cluster that’s over-parallelised for the amount of data.
I would suggest to look into these:
Control the number of output files
By default Spark uses something like spark.sql.shuffle.partitions = 200, which means your createDataFrame(...).write can easily produce ~200 tiny files for just 35k rows. The overhead of creating those files + committing metadata often dominates the runtime.
For a dataset of this size, you typically want a small number of files (1–4, maybe 8 max).
Key points:
Use coalesce(), not repartition(), right before the write.
coalesce(n) avoids a shuffle and just reduces the number of output partitions.
For 35k vectors, 1–4 files is absolutely fine and usually much faster to commit.
2. Avoid double computation / extra actions
You mentioned using persist() to avoid recomputing when you call count(). That’s good. Two extra tips:
Persist after your embedding transform, not on the original DF.
Only trigger one action before the final write (e.g. count() or maybe display() for debugging). Don’t call count(), show(), and then write without persistence, or Spark will recompute the whole pipeline multiple times.
3. Tuning cluster size & IO for this workload
For 35k rows of 3072-dim embeddings:
You don’t need a huge cluster.
Too many workers mean too many tiny output tasks and more small files.
Often a small cluster (e.g. 1–2 workers with decent memory) is faster end-to-end than a large autoscaling cluster for this kind of “wide but not huge” dataset.
Make sure you’re writing to a performant storage account (Premium / general purpose v2). In most managed Databricks setups, DBFS is already backed by appropriate storage, so usually this is fine.
If you see a lot of tiny tasks in the Spark UI, that’s a sign to:
lower spark.sql.shuffle.partitions for the job (e.g. 32 or even 8 for this size), and/or
coalesce before writing as shown above.
4. Data layout & Delta options (Stitch / OPTIMIZE)
For a one-off creation of 35k embeddings, the Delta housekeeping features (Stitch / OPTIMIZE) usually aren’t needed for performance of the write itself, but they matter if:
you will repeatedly append to this table,
you will query it a lot (e.g. for vector search candidates), or
you accidentally created many small files in early runs.
5. Data type choice for embeddings
You’re using ArrayType(FloatType()), which is fine. A few extra notes:
If you’re on a Databricks runtime that supports the VECTOR type (for native vector search), consider storing as VECTOR(3072) – it doesn’t massively change write speed, but it’s the recommended long-term format for similarity search.
If you stick to arrays, make sure the schema is stable between runs (same type, same dimension). Schema evolution (new columns or type changes) can add extra overhead due to Delta metadata handling.