cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

explorer
by New Contributor III
  • 3374 Views
  • 4 replies
  • 1 kudos

Resolved! Deleting records manually in databricks streaming table.

Hi Team , Let me know if there is any ways I can delete records manually from databricks streaming table without corrupting table and data.Can we delete the few records (based on some condition) manually in databricks streaming table (having checkpoi...

  • 3374 Views
  • 4 replies
  • 1 kudos
Latest Reply
JunYang
New Contributor III
  • 1 kudos

  If you use the applyChanges method in DLT for Change Data Capture (CDC), you can delete records manually without affecting the consistency of the table, as applyChanges respects manual deletions. You must configure your DLT pipeline to respect manu...

  • 1 kudos
3 More Replies
sparkstreaming
by New Contributor III
  • 4949 Views
  • 5 replies
  • 4 kudos

Resolved! Missing rows while processing records using foreachbatch in spark structured streaming from Azure Event Hub

I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...

  • 4949 Views
  • 5 replies
  • 4 kudos
Latest Reply
Rishi045
New Contributor III
  • 4 kudos

Were you able to achieve any solutions if yes please can you help with it.

  • 4 kudos
4 More Replies
AzureDatabricks
by New Contributor III
  • 7875 Views
  • 7 replies
  • 2 kudos

Resolved! Can we store 300 million records and what is the preferable compute type and config?

How we can persist 300 million records? What is the best option to persist data databricks hive metastore/Azure storage/Delta table?What is the limitations we have for deltatables of databricks in terms of data?We have usecase where testers should be...

  • 7875 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

You can certainly store 300 million records without any problem.The best option kinda depends on the use case. If you want to do a lot of online querying on the table, I suggest using delta lake, which is optimeized (using z-order, bloom filter, par...

  • 2 kudos
6 More Replies
AzureDatabricks
by New Contributor III
  • 4251 Views
  • 8 replies
  • 4 kudos

Resolved! Need to see all the records in DeltaTable. Exception - java.lang.OutOfMemoryError: GC overhead limit exceeded

Truncate False not working in Delta table.  df_delta.show(df_delta.count(),False)Computer size Single Node - Standard_F4S - 8GB Memory, 4 coresHow much max data we can persist in Delta table in Parquet file and How fast we can retrieve data.

  • 4251 Views
  • 8 replies
  • 4 kudos
Latest Reply
AzureDatabricks
New Contributor III
  • 4 kudos

thank you !!!

  • 4 kudos
7 More Replies
Jreco
by Contributor
  • 7772 Views
  • 14 replies
  • 3 kudos

Event hub streaming improve processing rate

Hi all,I'm working with event hubs and data bricks to process and enrich data in real-time.Doing a "simple" test, I'm getting some weird values (input rate vs processing rate) and I think I'm losing data:If you can see, there is a peak with 5k record...

image image
  • 7772 Views
  • 14 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Moderator
  • 3 kudos

hi @Jhonatan Reyes​ ,How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.Also, in this case ...

  • 3 kudos
13 More Replies
Labels