Thanks to everyone who joined the Hassle-Free Data Ingestion webinar. You can access the on-demand recording here.
We're sharing a subset of the phenomenal questions asked and answered throughout the session. You'll find Ingestion Q&A listed first, followed by some Delta Q&A. Please feel free to ask follow-up questions or add comments as threads.
TOPIC: Ingestion including Auto Loader and COPY INTO
Q: Are there any out-of-the-box tools with plug-and-play transformations that are available from Databricks to build data ingestion pipelines?
That is what Auto Loader and COPY INTO provide; with a few lines of script, you can build an advanced ingestion pipeline!
Q: What is COPY INTO and Auto Loader? Why have both?
COPY INTO is SQL and batch only. Auto Loader is Python/Scala and streaming or batch. Auto Loader is also available in SQL in DLT. COPY INTO is a simpler API. You can use both to write to a Delta table, but for complex ingestion workloads, we advise Auto Loader. Read more about Auto Loader and COPY INTO in the blog, Getting Started With Ingestion into Delta Lake. Since Auto Loader runs in a Databricks notebook (or Delta Live Table), you'll need to write your script either in Python, Scala or SQL.
Q: Once a file is ingested, is the source file no longer needed for any rollback to an earlier point in time?
That is correct, but they are good to keep around if you need to reprocess the file.
Q: Are there plans to work with a schema of XML files?
Yes, it is possible to read XML as String and use any XML library to parse.
Q: The default data type on Auto Loader is always a string. Can we give hints?
Yes! Learn more about Auto Loader Schema Inference and Evolution capabilities.
Q: Can the change data be directly ingested from databases?
You would need to use a CDC tool like AWS DMS; read this blog with more details.
Q: Is there any concise list of data source connectors?
https://docs.databricks.com/data/data-sources/index.html
Q: Is there an interface for Nifi?
This talk might be interesting for you, Story Deduplication and Mutation.
Q: How is Azure Event Hub supported for ingestion?
As a stream ingest, see this doc for more info.
Q: Are there any cookie-cutter templates available from Databricks for some common use cases?
We have solution accelerators you can follow!
Q: What is Databricks support for Hive until migrated to Delta Lake?
Databricks supports external Hive, details in the docs. Please reach out to your account team for help in migration.
TOPIC: Delta
Q: Where does Delta get involved during ingestion?
The data ingested is in a raw format like JSON or CSV, and it goes into a Delta table.
Q: Is it ever easier to just delete and remake your delta table with every update? If your Delta table is created from a Pandas DataFrame, for example?
You can easily convert your Pandas DataFrame to Spark DataFrame save it as Delta and benefit from Delta's ACID transactions.
Q: Can I roll back or roll forward a Delta table using a Databricks notebook? Would that change be persistent for other Databricks users?
You can use RESTORE to roll back, and other users will see the change. Read more in the docs.
Q: Can we delete partition-wise from Delta tables?
Yes, but you can also delete on a row-by-row basis in Delta.
Q: Is it possible to separate compute for Delta and compute for Spark?
There is a standalone open source reader/writer for Delta that would allow you to separate the two.
Add your follow-up questions to threads!