Data Engineering

Forum Posts

Sorted by:

by nolanlavender00 • New Contributor

08-20-2021 1:51:20 PM

5639 Views
2 replies
1 kudos

Resolved! How to stop a Streaming Job based on time of the week

I have an always-on job cluster triggering Spark Streaming jobs. I would like to stop this streaming job once a week to run table maintenance. I was looking to leverage the foreachBatch function to check a condition and stop the job accordingly.

Data Engineering

5639 Views
2 replies
1 kudos

08-20-2021 1:51:20 PM

View Replies

Latest Reply

mroy
Contributor

06-27-2024 8:27:46 PM

1 kudos

You could also use the "Available-now micro-batch" trigger. It only processes one batch at a time, and you can do whatever you want in between batches (sleep, shut down, vacuum, etc.)

1 kudos

06-27-2024 8:27:46 PM

1 More Replies

by RajaLakshmanan • New Contributor

08-20-2021 5:55:04 AM

3792 Views
2 replies
1 kudos

Resolved! Spark StreamingQuery not processing all data from source directory

Hi,I have setup a streaming process that consumers files from HDFS staging directory and writes into target location. Input directory continuesouly gets files from another process.Lets say file producer produces 5 million records sends it to hdfs sta...

Data Engineering

3792 Views
2 replies
1 kudos

08-20-2021 5:55:04 AM

View Replies

Latest Reply

User16763506586
Contributor

09-29-2021 4:12:37 AM

1 kudos

If it helps , you run try running the Left-Anti join on source and sink to identify missing records and see whether the record is in match with the schema provided or not

1 kudos

09-29-2021 4:12:37 AM

1 More Replies

by ArindamHalder • New Contributor II

08-17-2021 12:57:33 PM

2276 Views
1 replies
3 kudos

Resolved! Is there any performance result available for DeltaLake?

Specifically for write and read streaming data to HDFS or s3 etc. For IoT specific scenario how it performs on time series transactional data. Can we consider delta table as time series table?

Data Engineering

2276 Views
1 replies
3 kudos

08-17-2021 12:57:33 PM

View Replies

Latest Reply

mathan_pillai
Databricks Employee

09-20-2021 9:08:50 AM

3 kudos

Hi @Arindam Halder , Delta lake is more performant compared to a regular parquet table. pls check below for some stats on the performancehttps://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.htmlyes, you can use it for time series...

3 kudos

09-20-2021 9:08:50 AM

by MithuWagh • New Contributor

12-24-2019 4:14:09 AM

8147 Views
1 replies
0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

Data Engineering

8147 Views
1 replies
0 kudos

12-24-2019 4:14:09 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

12-30-2019 3:27:03 AM

0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

0 kudos

12-30-2019 3:27:03 AM

Databricks Community

Resolved! How to stop a Streaming Job based on time of the week

Resolved! Spark StreamingQuery not processing all data from source directory

Resolved! Is there any performance result available for DeltaLake?

How to deal with column name with .(dot) in pyspark dataframe??