cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Simulating Real Time Streaming in Databricks free edition

ManojkMohan
Contributor III

Use Case:

  1. Kafka real time steaming network telemetry logs
  2. In a real use case approx. 40 TB of data can be real time streamed in a day

Architecture:

ManojkMohan_0-1757780238326.png

issue encountered:  when i try to simulate kakfa like streaming in databricks itself , as this is a free editon and i am using serverless compute i am getting error

[INFINITE_STREAMING_TRIGGER_NOT_SUPPORTED] Trigger type ProcessingTime is not supported for this cluster type. Use a different trigger type e.g. AvailableNow, Once. SQLSTATE: 0A000

Code:

from pyspark.sql.functions import col, expr

# Generate 1000 rows per second
streamingDF = (spark.readStream
               .format("rate")
               .option("rowsPerSecond", 1000)
               .load())

# Add simulated IoT telemetry fields
telemetryDF = streamingDF.withColumn("tower_id", (col("value") % 50_000)) \
                         .withColumn("metric", expr("rand() * 100")) \
                         .withColumn("region", expr("CASE WHEN tower_id % 5 = 0 THEN 'North' ELSE 'South' END"))

# Write to Delta (simulated Bronze table)
query = (telemetryDF.writeStream
         .format("delta")
         .outputMode("append")
         .option("checkpointLocation", "/tmp/telemetry_checkpoints")
         .option("path", "/tmp/telemetry_bronze")
         .start())

Any suggestions for a work around

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @ManojkMohan ,

Unfortunately there is no workaround here. Free Edition supports only serverless compute. And serverless has following streaming limitation - one of which you just encountered - there is no support for default or time-based trigger intervals:

szymon_dybczak_0-1757787425119.png

You can still run your code but you need to provide supported trigger (for instance availableNow)

query = (telemetryDF.writeStream
         .format("delta")
         .outputMode("append")
         .option("checkpointLocation", "/tmp/telemetry_checkpoints")
         .option("path", "/tmp/telemetry_bronze")
         .trigger(availableNow=True)  # Process all available data now
         .start())

If you want to POC other streaming triggers you need to either have access to Databricks Premium workspace with classic compute or just download OSS Apache Spark docker container.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now