09-08-2022 03:43 AM
Hello ,
This is my first post here and I am a total beginner with DataBricks and spark.
Working on an IoT Cloud project with azure , I'm looking to set up a continuous stream processing of data.
A current architecture already exists thanks to Stream Analytics which redirects our data to the data lake in json and csv format updated continuously.
But data aggregation is limited with stream analytic and the integration of machine learning is to be considered for the future.
That's why I've been looking at Databricks for a week but I'm stuck on several points and I can't find an answer.
I managed to connect Databricks to an IoT Hub data sending simulation and to send it to blob with the pyspark script below which is incomplete I think. The problem is that each event will create a file, and the format that is used will create multiple parquet files.
My question is "is it possible to do the same thing that Stream Analytics is currently doing with databricks"?
I know that there is a lot of parameters to specify when creating the write request but none of them work, I have multiple errors probably because I'm doing it wrong. If someone could give me some good practice advice or redirect me to appropriate documentation it would be really appreciated.
09-08-2022 05:02 AM
You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.
Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.
09-08-2022 03:48 AM
So the event hub creates files (json/csv) on adls.
You can read those files into databricks with the spark.read.csv/json method. If you want to read many files in one go, you can use wildcards.
f.e. spark.read.json("/mnt/datalake/bronze/directory/*/*.json")
09-08-2022 04:30 AM
thanks for your answer but I'm trying to write the data in a csv/json file in the data lake
I don't try to read the data from the data lake, I receive directly the data from iot hub
or maybe I misunderstood your answer
09-08-2022 04:40 AM
sorry I misunderstood.
The reason you see this many parquet files is because parquet is immutable, you cannot change it (and parquet is used by delta lake). So if a new event/record is sent, the stream will process it and append it to the table, which means a new file too.
If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. This is known as compaction.
https://docs.delta.io/latest/best-practices.html#compact-files
Another option is to read the events less often, and go for a more batch-based approach.
09-08-2022 04:55 AM
so "Delta Lake" is the best way to store streaming data? it's impossible to store other than in parquet file?
The data I receive are in json, the objective would have been to store them continuously either in a json or in csv file until the maximum file size is reached before moving on to the next one like :
StreamingDataFolder/myfile1.json
StreamingDataFolder/myfile2.json
StreamingDataFolder/myfile3.json
etc...
But I did not deepen the delta lake option, I confess to have difficulty in understanding this storage option.
thanks for the link, I'll go look at it
09-08-2022 05:02 AM
You could use another format like csv and json, but I avoid using them as parquet (and delta lake) have nice features like a schema, partitioning, metadata.
Instead of directly reading iot Hub, you could first land the data from iot Hub into a data lake or blob storage (I think using a storage endpoint, not sure). And then afterwards read this data with databricks and process it.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group