cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Maksym
by New Contributor III
  • 9338 Views
  • 5 replies
  • 7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

  • 9338 Views
  • 5 replies
  • 7 kudos
Latest Reply
lassebe
New Contributor II
  • 7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

  • 7 kudos
4 More Replies
herry
by New Contributor III
  • 3964 Views
  • 4 replies
  • 4 kudos

Resolved! Get the list of loaded files from Autoloader

Hello,We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?I can easily do this in AWS Glue j...

  • 3964 Views
  • 4 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Herry Ramli​ - Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?Thanks!

  • 4 kudos
3 More Replies
Bhawna_bedi
by New Contributor II
  • 5713 Views
  • 5 replies
  • 5 kudos
  • 5713 Views
  • 5 replies
  • 5 kudos
Latest Reply
merca
Valued Contributor II
  • 5 kudos

If you are streaming to delta, not much, the micro batch will fail and in next time the stream will pick up from last successful write (due to ACID). I don't know about other formats, what happens if the stream is aborted in mid micro batch.

  • 5 kudos
4 More Replies
hari
by Contributor
  • 4878 Views
  • 3 replies
  • 3 kudos

Multiple streaming sources to the same delta table

Is it possible to have two streaming sources doing Merge into the same delta table with each source setting a different set of fields?We are trying to create a single table which will be used by the service layer for queries. The table can be populat...

  • 4878 Views
  • 3 replies
  • 3 kudos
Latest Reply
hari
Contributor
  • 3 kudos

Hi @Zachary Higgins​ Thanks for the replyCurrently, we are also using Trigger.once so that we can handle the merge stream dependencies properly. But was wondering whether we can scale our pipeline to be streaming by changing the Trigger duration in t...

  • 3 kudos
2 More Replies
Himanshi
by New Contributor III
  • 1863 Views
  • 1 replies
  • 6 kudos

How to exclude the existing files when we need to move the streaming job from one databricks workspace to another databricks workspace that may not be compatible with the existing checkpoint state to resume the stream processing?

We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...

  • 1863 Views
  • 1 replies
  • 6 kudos
Latest Reply
Shalabh007
Honored Contributor
  • 6 kudos

@Himanshi Patle​ in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.

  • 6 kudos
User16783853906
by Contributor III
  • 3596 Views
  • 5 replies
  • 5 kudos

Resolved! Update code for a streaming job in Production

How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?

  • 3596 Views
  • 5 replies
  • 5 kudos
Latest Reply
Anonymous
Not applicable
  • 5 kudos

Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.MA Health Connector

  • 5 kudos
4 More Replies
Confused
by New Contributor III
  • 9778 Views
  • 7 replies
  • 2 kudos

Schema evolution issue

Hi AllI am loading some data using auto loader but am having trouble with Schema evolution.A new column has been added to the data I am loading and I am getting the following error:StreamingQueryException: Encountered unknown field(s) during parsing:...

  • 9778 Views
  • 7 replies
  • 2 kudos
Latest Reply
rgrosskopf
New Contributor II
  • 2 kudos

I agree that hints are the way to go if you have the schema available but the whole point of schema evolution is that you might not always know the schema in advance.I received a similar error with a similar streaming query configuration. The issue w...

  • 2 kudos
6 More Replies
Confused
by New Contributor III
  • 5417 Views
  • 3 replies
  • 3 kudos

Resolved! Dealing with updates to a delta table being used as a streaming source

Hi AllI have a requirement to perform updates on a delta table that is the source for a streaming query.I would like to be able to update the table and have the stream continue to work while also not ending up with duplicates.From my research it se...

  • 5417 Views
  • 3 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hey @Mathew Walters​ Hope you are doing great.Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution? Else please let us know if you need more help. We'd love to hear from you.Thanks!

  • 3 kudos
2 More Replies
dataslicer
by Contributor
  • 5790 Views
  • 7 replies
  • 2 kudos

Resolved! Exploring additional cost saving options for structured streaming 24x7x365 uptime workloads

I currently have multiple jobs (each running its own job cluster) for my spark structured streaming pipelines that are long running 24x7x365 on DBR 9.x/10.x LTS. My SLAs are 24x7x365 with 1 minute latency. I have already accomplished the following co...

  • 5790 Views
  • 7 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

http://doramasmp4.tv/

  • 2 kudos
6 More Replies
Anonymous
by Not applicable
  • 6146 Views
  • 8 replies
  • 2 kudos

Resolved! Issue in creating workspace - Custom AWS Configuration

We have tried to create new workspace using "Custom AWS Configuration" and we have given our own VPC (Customer managed VPC) and tried but workspace failed to launch. We are getting below error which couldn't understand where the issue is in.Workspace...

  • 6146 Views
  • 8 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Mitesh Patel​ - As Atanu thinks the issue may be resolved, I wanted to check in with you, also. How goes it?

  • 2 kudos
7 More Replies
GMO
by New Contributor III
  • 2832 Views
  • 4 replies
  • 1 kudos

Resolved! Trigger.AvailableOnce in Pyspark?

There’s a new Trigger.AvailableOnce option in runtime 10.1 that we need to process a large folder bit by bit using Autoloader. But I don’t see how to engage this from pyspark.  Is this accessible from scala only or is it available in pyspark? Thanks...

  • 2832 Views
  • 4 replies
  • 1 kudos
Latest Reply
pottsork
New Contributor II
  • 1 kudos

Any update on this issue? I can see that one can use .trigger(availableNow=True) i DBR 10.3 (On Azure Databricks).... Unfortunately I can't get it to work with Autoloader. Is this supported? Additionally, can't find any answers when skimming through ...

  • 1 kudos
3 More Replies
BorislavBlagoev
by Valued Contributor III
  • 5521 Views
  • 4 replies
  • 4 kudos

Resolved! Databricks writeStream checkpoint

I'm trying to execute this writeStream data_frame.writeStream.format("delta") \ .option("checkpointLocation", checkpoint_path) \ .trigger(processingTime="1 second") \ .option("mergeSchema", "true") \ .o...

  • 5521 Views
  • 4 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

You can remove that folder so it will be recreated automatically.Additionally every new job run should have new (or just empty) checkpoint location.You can add in your code before running streaming:dbutils.fs.rm(checkpoint_path, True)Additionally you...

  • 4 kudos
3 More Replies
Labels