Databricks

Aran_Oribu · ‎09-08-2022

Hello ,

This is my first post here and I am a total beginner with DataBricks and spark.

Working on an IoT Cloud project with azure , I'm looking to set up a continuous stream processing of data.

A current architecture already exists thanks to Stream Analytics which redirects our data to the data lake in json and csv format updated continuously.

But data aggregation is limited with stream analytic and the integration of machine learning is to be considered for the future.

That's why I've been looking at Databricks for a week but I'm stuck on several points and I can't find an answer.

I managed to connect Databricks to an IoT Hub data sending simulation and to send it to blob with the pyspark script below which is incomplete I think. The problem is that each event will create a file, and the format that is used will create multiple parquet files.

My question is "is it possible to do the same thing that Stream Analytics is currently doing with databricks"?

I know that there is a lot of parameters to specify when creating the write request but none of them work, I have multiple errors probably because I'm doing it wrong. If someone could give me some good practice advice or redirect me to appropriate documentation it would be really appreciated.

-werners- · ‎09-08-2022

You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.

Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.

View solution in original post

-werners- · ‎09-08-2022

So the event hub creates files (json/csv) on adls.

You can read those files into databricks with the spark.read.csv/json method. If you want to read many files in one go, you can use wildcards.

f.e. spark.read.json("/mnt/datalake/bronze/directory/*/*.json")

Aran_Oribu · ‎09-08-2022

thanks for your answer but I'm trying to write the data in a csv/json file in the data lake

I don't try to read the data from the data lake, I receive directly the data from iot hub

or maybe I misunderstood your answer

-werners- · ‎09-08-2022

sorry I misunderstood.

The reason you see this many parquet files is because parquet is immutable, you cannot change it (and parquet is used by delta lake). So if a new event/record is sent, the stream will process it and append it to the table, which means a new file too.

If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. This is known as compaction.

https://docs.delta.io/latest/best-practices.html#compact-files

Another option is to read the events less often, and go for a more batch-based approach.

Aran_Oribu · ‎09-08-2022

so "Delta Lake" is the best way to store streaming data? it's impossible to store other than in parquet file?

The data I receive are in json, the objective would have been to store them continuously either in a json or in csv file until the maximum file size is reached before moving on to the next one like :

StreamingDataFolder/myfile1.json

StreamingDataFolder/myfile2.json

StreamingDataFolder/myfile3.json

etc...

But I did not deepen the delta lake option, I confess to have difficulty in understanding this storage option.

thanks for the link, I'll go look at it

-werners- · ‎09-08-2022

You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.

Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.

Databricks

Create and update a csv/json file in ADLSG2 with Eventhub in Databricks streaming

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs