Hi
Youโve optimised the embedding side really nicely already, batching in mapPartitions and creating one Azure client per partition is exactly what we recommend.
For 35k rows, if embedding is fast but the Delta write/commit is slow, itโs almost always due to:
too many small output files, and/or
extra passes over the DataFrame, and/or
a cluster thatโs over-parallelised for the amount of data.
I would suggest to look into these:
Control the number of output files
By default Spark uses something like spark.sql.shuffle.partitions = 200, which means your createDataFrame(...).write can easily produce ~200 tiny files for just 35k rows. The overhead of creating those files + committing metadata often dominates the runtime.
For a dataset of this size, you typically want a small number of files (1โ4, maybe 8 max).
Key points:
Use coalesce(), not repartition(), right before the write.
coalesce(n) avoids a shuffle and just reduces the number of output partitions.
For 35k vectors, 1โ4 files is absolutely fine and usually much faster to commit.
2. Avoid double computation / extra actions
You mentioned using persist() to avoid recomputing when you call count(). Thatโs good. Two extra tips:
Persist after your embedding transform, not on the original DF.
Only trigger one action before the final write (e.g. count() or maybe display() for debugging). Donโt call count(), show(), and then write without persistence, or Spark will recompute the whole pipeline multiple times.
3. Tuning cluster size & IO for this workload
For 35k rows of 3072-dim embeddings:
You donโt need a huge cluster.
Too many workers mean too many tiny output tasks and more small files.
Often a small cluster (e.g. 1โ2 workers with decent memory) is faster end-to-end than a large autoscaling cluster for this kind of โwide but not hugeโ dataset.
Make sure youโre writing to a performant storage account (Premium / general purpose v2). In most managed Databricks setups, DBFS is already backed by appropriate storage, so usually this is fine.
If you see a lot of tiny tasks in the Spark UI, thatโs a sign to:
lower spark.sql.shuffle.partitions for the job (e.g. 32 or even 8 for this size), and/or
coalesce before writing as shown above.
4. Data layout & Delta options (Stitch / OPTIMIZE)
For a one-off creation of 35k embeddings, the Delta housekeeping features (Stitch / OPTIMIZE) usually arenโt needed for performance of the write itself, but they matter if:
you will repeatedly append to this table,
you will query it a lot (e.g. for vector search candidates), or
you accidentally created many small files in early runs.
5. Data type choice for embeddings
Youโre using ArrayType(FloatType()), which is fine. A few extra notes:
If youโre on a Databricks runtime that supports the VECTOR type (for native vector search), consider storing as VECTOR(3072) โ it doesnโt massively change write speed, but itโs the recommended long-term format for similarity search.
If you stick to arrays, make sure the schema is stable between runs (same type, same dimension). Schema evolution (new columns or type changes) can add extra overhead due to Delta metadata handling.