cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting complex/unstructured data

tobyevans
New Contributor

Hi there,

my company is reasonably new to using Databricks, and we're running our first PoCs.  Some of the data we have structured/reasonably structured, so it drops into a bucket, we point a notebook at it, and all is well and Delta

The problem is arising with some of the more complex datasources, that have been developed over the years, often designed to work with specialist engineering software. Typically, this will come in a zip file, with a bunch of files in all kinds of formats and shapes. It's quite common to have "csv" files that include many tables, basically a giant print output, or actual csv files that have been heavily pivoted so that almost every column name, and number of columns, varies between files - all depending on the inputs, which are provided but again have a complex file format

So far, so normal - all this can be parsed out either with Python and patience, or occasionally an exe file is also provided to convert some of the raw files into JSON etc. The question is one of hosting, and there is a debate which I would like to extend to here

1. This is complex and non-standard processing, wrap this up in a container or other process, and run it in advance of Databricks, extracting the required data and placing it in a Dataframe-friendly format in cloud storage, ready to be read into a Live Table etc. This has the advantage of being able to scale out with the number of files that arrive, and can handle weird dependences, but requires extra custom infrastructure

2. Databricks has a Python runtime in the cluster, so can run any scripts given to it. This has the advantage of not requiring extra deployment of infrastructure such as containers, especially since there may be a growing number of such scenarios and we don't want to manage that if we don't have to. However, since this is just Python, not PySpark, no RDDs will be created, so this won't be a scaleable process using autoscaling Executors, and limited to the amount of parallelisation we could squeeze out of a Driver node. And not considering any more esoteric dependencies such as custom EXEs to do custom parsing


Has anyone had similar problems that they've solved in way that both scales without bloating?  I've experience of Spark, but still relatively new to Databricks so there may be suitable tools available that I'm not aware of

thanks

Toby

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager
 

Kaniz_Fatma
Community Manager
Community Manager

Hi @tobyevans_arup, Your question is quite complex and touches on several aspects of data processing with Databricks. 

 

Here are some insights based on the information I found:

 

Handling Complex Data Sources: Databricks provides several features to handle complex data formats. It supports structured, semi-structured, and unstructured data1. For instance, you can use Spark SQL’s built-in functions to work with various data formats and trans.... Databricks also simplifies semi-structured data ingestion into Delta Lake2. You can use the from_json method to cast nested results into more complex data types.

Scaling Data Processing: Databricks has an optimized autoscaling feature that leverages Spark shuffl...4. This feature allows clusters to scale up and down more aggressively in response to load4. Databricks adds clusters based on the time it would take to process all currently running queries, a...5.

Given these capabilities, you might want to consider a hybrid approach:

  • For complex data sources that require significant preprocessing, you could use a separate process (like a container) to transform the data into a DataFrame-friendly format. This process could be scaled out as needed.
  • For data sources that can be handled directly by Databricks, you could leverage the Python runtime in the Databricks cluster and the autoscaling feature to process the data efficiently.

I hope this helps! Let me know if you have any other questions. 😊

Kaniz_Fatma
Community Manager
Community Manager

Hey there! Thanks a bunch for being part of our awesome community! 🎉 

We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution for you. And remember, if you ever need more help , we're here for you! 

Keep being awesome! 😊🚀

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group