a month ago
In ingestion pipeline, for query based connector there is option of selecting batch snapshot instead of column name under dropdown - Cursor column. If I choose batch snapshot, will the databricks engine run select * from my source table, say Sql server as per job schedule. Or does it do something intelligent use some where clause and avoid selecting the whole table
a month ago
Hi @manish_de !
For query based lakeflow connect, DBKS is not doing CDC magic unless you provide a valid cursor column.
If you select batch snapshot instead of a cursor column, you should treat it as a full snapshot or full load each scheduled run.
In practice, DBKS needs to read the source result set again from SQL Server then apply or merge the result into the destination. It may not literally issue SELECT * in every case because it can project configured columns but conceptually it is scanning the source query without an incremental WHERE cursor > last_watermark filter.
When a cursor column is configured on each run DBKS retrieves rows where the cursor value is greater than the previously stored high water mark and if you do not select a monotonically increasing cursor column the connector performs a full load on each run.
Think of it like batch snapshot = full source snapshot per run.
a month ago
@amirabedhiafi I just wanted to double check, there will be no where clause in the sql sent everytime to my database for batch snapshot?
a month ago
Yes ! because with batch snapshot DBKS should be treated as doing a full snapshot read each run and won't automatically add WHERE to avoid scanning the whole source table.
a month ago
From what Iโve seen, choosing batch snapshot usually means the connector treats each run like a full snapshot load, so yes, it may execute something very close to a `SELECT *` on the source table unless the platform adds its own partitioning or optimization behind the scenes. If no cursor column or incremental filter is defined, there generally isnโt a smart `WHERE` clause to reduce reads, so checking the generated rbtv query logs in Databricks or SQL Server profiler would give the clearest answer.
3 weeks ago
@rbtv It may execute something very similar to a `SELECT *` on the source table unless the platform adds its own partitioning or optimisation behind the scenes. From what I've observed, selecting batch snapshot often means the connector handles each run like a full snapshot load. The most straightforward solution would be to examine the generated rbtv query logs in Databricks or SQL Server Profiler. If no cursor column or incremental filter is specified, there typically isn't a smart `WHERE` clause to reduce reads.