by
Maksym
• New Contributor III
- 9338 Views
- 5 replies
- 7 kudos
I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark
.readStream
.format("cloudFiles")
.option('cloudFil...
- 9338 Views
- 5 replies
- 7 kudos
Latest Reply
I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!
4 More Replies
by
herry
• New Contributor III
- 3964 Views
- 4 replies
- 4 kudos
Hello,We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?I can easily do this in AWS Glue j...
- 3964 Views
- 4 replies
- 4 kudos
Latest Reply
@Herry Ramli - Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?Thanks!
3 More Replies
- 4878 Views
- 3 replies
- 3 kudos
Is it possible to have two streaming sources doing Merge into the same delta table with each source setting a different set of fields?We are trying to create a single table which will be used by the service layer for queries. The table can be populat...
- 4878 Views
- 3 replies
- 3 kudos
Latest Reply
Hi @Zachary Higgins Thanks for the replyCurrently, we are also using Trigger.once so that we can handle the merge stream dependencies properly. But was wondering whether we can scale our pipeline to be streaming by changing the Trigger duration in t...
2 More Replies
- 1863 Views
- 1 replies
- 6 kudos
We do not want to process all the old files, we only wanted to process latest files. whenever we use the new checkpoint path in another databricks workspace, streaming job is processing all the old files as well. Without autoloader feature, is there ...
- 1863 Views
- 1 replies
- 6 kudos
Latest Reply
@Himanshi Patle in spark streaming there is one option maxFileAge using which you can control which files to process based on their timestamp.
- 3596 Views
- 5 replies
- 5 kudos
How to update a streaming job in production with minimal/no downtime when there are significant code changes that may not be compatible with the existing checkpoint state to resume the stream processing?
- 3596 Views
- 5 replies
- 5 kudos
Latest Reply
Thanks for the information, I will try to figure it out for more. Keep sharing such informative post keep suggesting such post.MA Health Connector
4 More Replies
- 9778 Views
- 7 replies
- 2 kudos
Hi AllI am loading some data using auto loader but am having trouble with Schema evolution.A new column has been added to the data I am loading and I am getting the following error:StreamingQueryException: Encountered unknown field(s) during parsing:...
- 9778 Views
- 7 replies
- 2 kudos
Latest Reply
I agree that hints are the way to go if you have the schema available but the whole point of schema evolution is that you might not always know the schema in advance.I received a similar error with a similar streaming query configuration. The issue w...
6 More Replies
by
Kapur
• New Contributor II
- 769 Views
- 0 replies
- 2 kudos
Is it Delta lake frae work merge operations require schema for spark structural stream processsing ?
- 769 Views
- 0 replies
- 2 kudos
- 5417 Views
- 3 replies
- 3 kudos
Hi AllI have a requirement to perform updates on a delta table that is the source for a streaming query.I would like to be able to update the table and have the stream continue to work while also not ending up with duplicates.From my research it se...
- 5417 Views
- 3 replies
- 3 kudos
Latest Reply
Hey @Mathew Walters Hope you are doing great.Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution? Else please let us know if you need more help. We'd love to hear from you.Thanks!
2 More Replies
- 5790 Views
- 7 replies
- 2 kudos
I currently have multiple jobs (each running its own job cluster) for my spark structured streaming pipelines that are long running 24x7x365 on DBR 9.x/10.x LTS. My SLAs are 24x7x365 with 1 minute latency. I have already accomplished the following co...
- 5790 Views
- 7 replies
- 2 kudos
- 6146 Views
- 8 replies
- 2 kudos
We have tried to create new workspace using "Custom AWS Configuration" and we have given our own VPC (Customer managed VPC) and tried but workspace failed to launch. We are getting below error which couldn't understand where the issue is in.Workspace...
- 6146 Views
- 8 replies
- 2 kudos
Latest Reply
@Mitesh Patel - As Atanu thinks the issue may be resolved, I wanted to check in with you, also. How goes it?
7 More Replies
by
GMO
• New Contributor III
- 2832 Views
- 4 replies
- 1 kudos
There’s a new Trigger.AvailableOnce option in runtime 10.1 that we need to process a large folder bit by bit using Autoloader. But I don’t see how to engage this from pyspark. Is this accessible from scala only or is it available in pyspark? Thanks...
- 2832 Views
- 4 replies
- 1 kudos
Latest Reply
Any update on this issue? I can see that one can use .trigger(availableNow=True) i DBR 10.3 (On Azure Databricks).... Unfortunately I can't get it to work with Autoloader. Is this supported? Additionally, can't find any answers when skimming through ...
3 More Replies
- 5521 Views
- 4 replies
- 4 kudos
I'm trying to execute this writeStream data_frame.writeStream.format("delta") \
.option("checkpointLocation", checkpoint_path) \
.trigger(processingTime="1 second") \
.option("mergeSchema", "true") \
.o...
- 5521 Views
- 4 replies
- 4 kudos
Latest Reply
You can remove that folder so it will be recreated automatically.Additionally every new job run should have new (or just empty) checkpoint location.You can add in your code before running streaming:dbutils.fs.rm(checkpoint_path, True)Additionally you...
3 More Replies