cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

mikeagicman
by New Contributor
  • 917 Views
  • 1 replies
  • 0 kudos

Handling Unknown Fields in DLT Pipeline

HiI'm working on a DLT pipeline where I read JSON files stored in S3.I'm using the auto loader to identify the file schema and adding schema hints for some fields to specify their type.When running it against a single data file that contains addition...

  • 917 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @mikeagicman, When you encounter the error message 'terminated with exception: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing.', it means that the data file contains fields that are not defi...

  • 0 kudos
939772
by New Contributor III
  • 1155 Views
  • 1 replies
  • 0 kudos

Resolved! DLT refresh unexpectedly failing

We're hitting an error with a delta live table refresh since yesterday; nothing has changed in our system yet there appears to be a configuration error: { ... "timestamp": "2024-04-08T23:00:10.630Z", "message": "Update b60485 is FAILED.",...

  • 1155 Views
  • 1 replies
  • 0 kudos
Latest Reply
939772
New Contributor III
  • 0 kudos

Apparently the `custom_tags` of `ResourceClass` is now extraneous -- removing it from config corrected our problem.

  • 0 kudos
brian_zavareh
by New Contributor III
  • 4228 Views
  • 5 replies
  • 4 kudos

Resolved! Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets

I'm currently facing challenges with optimizing the performance of a Delta Live Table pipeline in Azure Databricks. The task involves ingesting over 10 TB of raw JSON log files from an Azure Data Lake Storage account into a bronze Delta Live Table la...

Data Engineering
autoloader
bigdata
delta-live-tables
json
  • 4228 Views
  • 5 replies
  • 4 kudos
Latest Reply
standup1
Contributor
  • 4 kudos

Hey @brian_zavareh , see this document. I hope this can help.https://learn.microsoft.com/en-us/azure/databricks/compute/cluster-config-best-practicesJust keep in mind that there's some extra cost from Azure VM side, check your Azure Cost Analysis for...

  • 4 kudos
4 More Replies
standup1
by Contributor
  • 1760 Views
  • 2 replies
  • 0 kudos

Recover a deleted DLT pipeline

Hello,does anyone know how to recover a deleted dlt pipeline, or at least recover deleted tables that were managed by the dlt pipeline ? We have a pipeline that stopped working and throwing all kind of errors, so we decided to create a new one and de...

  • 1760 Views
  • 2 replies
  • 0 kudos
Latest Reply
standup1
Contributor
  • 0 kudos

Thank you, Kanzi. Just to confirm that I understood you correctly. If the pipeline is deleted [like in our case] without having version control, backup configuration..etc already implemented. There's no way to recover those tables, not the pipeline. ...

  • 0 kudos
1 More Replies
Shas_DataE
by New Contributor II
  • 1568 Views
  • 2 replies
  • 0 kudos

Alerts and Dashboard

Hi Team,In my Databricks workspace, i have created an alerts using the query in such a way the schedule will run on daily basis and the results will get populated to dashboard. The results from dashboard will be notified via email, but i am seeing re...

  • 1568 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ayushi_Suthar
Honored Contributor
  • 0 kudos

HI @Shas_DataE, Good Day!  Could you please check and confirm if there are any special characters in the table column? At this moment, special characters are compatible with Excel.  If yes then please drop the column that has that special character a...

  • 0 kudos
1 More Replies
Kibour
by Contributor
  • 1888 Views
  • 3 replies
  • 1 kudos

Resolved! date_format 'LLLL' returns '1'

Hi all,In my notebook, when I run my cell with following code%sqlselect date_format(date '1970-01-01', "LLL");I get '1', while I expect 'Jan' according to the dochttps://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html I would also expect t...

  • 1888 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kibour
Contributor
  • 1 kudos

Hi @Kaniz_Fatma ,Turns out it was actually a Java 8 bug:IllegalArgumentException: Java 8 has a bug to support stand-alone form (3 or more 'L' or 'q' in the pattern string). Please use 'M' or 'Q' instead, or upgrade your Java version. For more details...

  • 1 kudos
2 More Replies
Kibour
by Contributor
  • 1674 Views
  • 1 replies
  • 0 kudos

Resolved! Trigger one workflow after completion of another workflow

Hi there,Is it possible to trigger one workflow conditionnally on the completion of another workflow? Typically, I would like to have my workflow W2 to start automatically once the workflow W1 has successfully completed.Thanks in advance for your ins...

  • 1674 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kibour
Contributor
  • 0 kudos

Found it: you build a new workflow where you connect W1 and W2 (each as a Run Job).

  • 0 kudos
Braxx
by Contributor II
  • 7611 Views
  • 6 replies
  • 2 kudos

Resolved! issue with group by

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.Unfortunatelly I am getting the error:"TypeError:...

  • 7611 Views
  • 6 replies
  • 2 kudos
Latest Reply
Ralphma
New Contributor II
  • 2 kudos

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashab...

  • 2 kudos
5 More Replies
ADBQueries
by New Contributor
  • 1769 Views
  • 2 replies
  • 0 kudos

DBEAVER Connection to Sql Warehouse in Databricks

I'm trying to connect to SQL warehouse in Azure Datebricks with DBEAVER application.I'm creating a jdbc connection string as mentioned here: https://docs.databricks.com/en/integrations/jdbc/authentication.htmlHere is a sample connection link I have c...

  • 1769 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ayushi_Suthar
Honored Contributor
  • 0 kudos

Hi @ADBQueries , Good Day!  Could you please try running the code again to generate another access token and, once generated, check it on this page, https://jwt.ms, to confirm that the token has not expired? Also, if not done yet, please review the f...

  • 0 kudos
1 More Replies
acagatayyilmaz
by New Contributor
  • 1691 Views
  • 1 replies
  • 0 kudos

How to find consumed DBU

Hi All,I'm trying to understand my databricks consumption to purchase a reservation. However, I couldnt find the consumed DBU in both Azure Portal and Databricks workspace.I'm also exporting and processing Azure Cost data daily. When I check the reso...

  • 1691 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ayushi_Suthar
Honored Contributor
  • 0 kudos

Hi @acagatayyilmaz , Hope you are doing well!  You can refer to the Billable usage system table to find the records of consumed DBU. You can go through the below document to understand more about the System tables:  https://learn.microsoft.com/en-us/...

  • 0 kudos
vanepet
by New Contributor II
  • 14150 Views
  • 5 replies
  • 2 kudos

Is it possible to use multiprocessing or threads to submit multiple queries to a database from Databricks in parallel?

We are trying to improve our overall runtime by running queries in parallel using either multiprocessing or threads. What I am seeing though is that when the function that runs this code is run on a separate process it doesnt return a dataFrame with...

  • 14150 Views
  • 5 replies
  • 2 kudos
Latest Reply
BapsDBS
New Contributor II
  • 2 kudos

Thanks for the links mentioned above. But both of them uses raw python to achieve parallelism. Does this mean Spark (read PySpark) does exactly provisions for parallel execution of functions or even notebooks ? We used a wrapper notebook with ThreadP...

  • 2 kudos
4 More Replies
RIDBX
by New Contributor II
  • 1751 Views
  • 2 replies
  • 0 kudos

What is the bestway to handle huge gzipped file dropped to S3 ?

What is the bestway to handle huge gzipped file dropped to S3 ?=================================================I find some intereting suggestions for posted questions. Thanks for reviewing my threads. Here is the situation we have.We are getting dat...

Data Engineering
bulkload
S3
  • 1751 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @RIDBX,  One approach is to avoid using DataFrames and instead use RDDs (Resilient Distributed Datasets) for repartitioning.Read the gzipped files as RDDs, repartition them into smaller partitions, and save them in a splittable format (e.g., Snapp...

  • 0 kudos
1 More Replies
zerodarkzone
by New Contributor III
  • 1338 Views
  • 2 replies
  • 1 kudos

Cannot create vnet peering on Azure Databricks

Hi,I'm trying to create a VNET peering using to SAP hana using the default VNET created by databricks but it is not possible.I'm getting the following errorNo se pudo agregar el emparejamiento de red virtual "PeeringSAP" a "workers-vnet". Error: El c...

Data Engineering
Azure Databricks
peering
vnet
  • 1338 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @zerodarkzone,  Ensure that the user has the necessary permissions to manage network resources. Specifically, they should have the permission to perform the action "Microsoft.Network/virtualNetworks/virtualNetworkPeerings/write" within the scope o...

  • 1 kudos
1 More Replies
jx1226
by New Contributor II
  • 1765 Views
  • 2 replies
  • 0 kudos

Connect Workspace EnableNoPublicIP=No and VnetInject=No to storage account with Private Endpoint.

We know that Databricks with VNET injection (our own VNET) allows is to connect to blob storage/ ADLS Gen2 over private endpoints and peering. This is what we typically do.We have a client who created Databricks with EnableNoPublicIP=No (secure clust...

  • 1765 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16539034020
Contributor II
  • 0 kudos

Hello,  Thanks for contacting Databricks Support.You need to enable EnableNoPublicIP,  otherwise, you will get the error message "cannot be deployed on subnet containing Basic SKU Public IP addresses or Basic SKU Load Balancer. NIC", it was usually t...

  • 0 kudos
1 More Replies
Spenyo
by New Contributor II
  • 944 Views
  • 1 replies
  • 1 kudos

Resolved! Delta table size not shrinking after Vacuum

Hi team.Everyday once we overwrite the last X month data in tables. So it generate a every day a larger amount of history. We don't use time travel so we don't need it.What we done:SET spark.databricks.delta.retentionDurationCheck.enabled = false ALT...

chrome_KZMxPl8x1d.png
  • 944 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Spenyo,  Consider increasing the retention duration if you need to retain historical data for longer periods.If you’re not using time travel, you can set a retention interval of at least 7 days to strike a balance between history retention and st...

  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels