Hi there,
my company is reasonably new to using Databricks, and we're running our first PoCs. Some of the data we have structured/reasonably structured, so it drops into a bucket, we point a notebook at it, and all is well and Delta
The problem is arising with some of the more complex datasources, that have been developed over the years, often designed to work with specialist engineering software. Typically, this will come in a zip file, with a bunch of files in all kinds of formats and shapes. It's quite common to have "csv" files that include many tables, basically a giant print output, or actual csv files that have been heavily pivoted so that almost every column name, and number of columns, varies between files - all depending on the inputs, which are provided but again have a complex file format
So far, so normal - all this can be parsed out either with Python and patience, or occasionally an exe file is also provided to convert some of the raw files into JSON etc. The question is one of hosting, and there is a debate which I would like to extend to here
1. This is complex and non-standard processing, wrap this up in a container or other process, and run it in advance of Databricks, extracting the required data and placing it in a Dataframe-friendly format in cloud storage, ready to be read into a Live Table etc. This has the advantage of being able to scale out with the number of files that arrive, and can handle weird dependences, but requires extra custom infrastructure
2. Databricks has a Python runtime in the cluster, so can run any scripts given to it. This has the advantage of not requiring extra deployment of infrastructure such as containers, especially since there may be a growing number of such scenarios and we don't want to manage that if we don't have to. However, since this is just Python, not PySpark, no RDDs will be created, so this won't be a scaleable process using autoscaling Executors, and limited to the amount of parallelisation we could squeeze out of a Driver node. And not considering any more esoteric dependencies such as custom EXEs to do custom parsing
Has anyone had similar problems that they've solved in way that both scales without bloating? I've experience of Spark, but still relatively new to Databricks so there may be suitable tools available that I'm not aware of
thanks
Toby