cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Create and update a csv/json file in ADLSG2 with Eventhub in Databricks streaming

Aran_Oribu
New Contributor II

Hello ,

This is my first post here and I am a total beginner with DataBricks and spark.

Working on an IoT Cloud project with azure , I'm looking to set up a continuous stream processing of data.

A current architecture already exists thanks to Stream Analytics which redirects our data to the data lake in json and csv format updated continuously.

But data aggregation is limited with stream analytic and the integration of machine learning is to be considered for the future.

That's why I've been looking at Databricks for a week but I'm stuck on several points and I can't find an answer.

I managed to connect Databricks to an IoT Hub data sending simulation and to send it to blob with the pyspark script below which is incomplete I think. The problem is that each event will create a file, and the format that is used will create multiple parquet files. 

My question is "is it possible to do the same thing that Stream Analytics is currently doing with databricks"?

I know that there is a lot of parameters to specify when creating the write request but none of them work, I have multiple errors probably because I'm doing it wrong. If someone could give me some good practice advice or redirect me to appropriate documentation it would be really appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.

Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

So the event hub creates files (json/csv) on adls.

You can read those files into databricks with the spark.read.csv/json method. If you want to read many files in one go, you can use wildcards.

f.e. spark.read.json("/mnt/datalake/bronze/directory/*/*.json")

thanks for your answer but I'm trying to write the data in a csv/json file in the data lake 

I don't try to read the data from the data lake, I receive directly the data from iot hub

or maybe I misunderstood your answer

-werners-
Esteemed Contributor III

sorry I misunderstood.

The reason you see this many parquet files is because parquet is immutable, you cannot change it (and parquet is used by delta lake). So if a new event/record is sent, the stream will process it and append it to the table, which means a new file too.

If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. This is known as compaction.

https://docs.delta.io/latest/best-practices.html#compact-files

Another option is to read the events less often, and go for a more batch-based approach.

so "Delta Lake" is the best way to store streaming data? it's impossible to store other than in parquet file? 

The data I receive are in json, the objective would have been to store them continuously either in a json or in csv file until the maximum file size is reached before moving on to the next one like :

StreamingDataFolder/myfile1.json

StreamingDataFolder/myfile2.json

StreamingDataFolder/myfile3.json 

etc...

But I did not deepen the delta lake option, I confess to have difficulty in understanding this storage option.

thanks for the link, I'll go look at it

-werners-
Esteemed Contributor III

You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.

Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.