cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting complex/unstructured data

tobyevans
New Contributor II

Hi there,

my company is reasonably new to using Databricks, and we're running our first PoCs.  Some of the data we have structured/reasonably structured, so it drops into a bucket, we point a notebook at it, and all is well and Delta

The problem is arising with some of the more complex datasources, that have been developed over the years, often designed to work with specialist engineering software. Typically, this will come in a zip file, with a bunch of files in all kinds of formats and shapes. It's quite common to have "csv" files that include many tables, basically a giant print output, or actual csv files that have been heavily pivoted so that almost every column name, and number of columns, varies between files - all depending on the inputs, which are provided but again have a complex file format

So far, so normal - all this can be parsed out either with Python and patience, or occasionally an exe file is also provided to convert some of the raw files into JSON etc. The question is one of hosting, and there is a debate which I would like to extend to here

1. This is complex and non-standard processing, wrap this up in a container or other process, and run it in advance of Databricks, extracting the required data and placing it in a Dataframe-friendly format in cloud storage, ready to be read into a Live Table etc. This has the advantage of being able to scale out with the number of files that arrive, and can handle weird dependences, but requires extra custom infrastructure

2. Databricks has a Python runtime in the cluster, so can run any scripts given to it. This has the advantage of not requiring extra deployment of infrastructure such as containers, especially since there may be a growing number of such scenarios and we don't want to manage that if we don't have to. However, since this is just Python, not PySpark, no RDDs will be created, so this won't be a scaleable process using autoscaling Executors, and limited to the amount of parallelisation we could squeeze out of a Driver node. And not considering any more esoteric dependencies such as custom EXEs to do custom parsing


Has anyone had similar problems that they've solved in way that both scales without bloating?  I've experience of Spark, but still relatively new to Databricks so there may be suitable tools available that I'm not aware of

thanks

Toby

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group