cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

my_super_name
by New Contributor II
  • 2747 Views
  • 2 replies
  • 2 kudos

Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

Hello,I'm using the auto loader to stream a table of data and have added schema hints to specify field values.I've observed that when my initial data file is missing fields specified in the schema hint,the auto loader correctly identifies this and ad...

  • 2747 Views
  • 2 replies
  • 2 kudos
Latest Reply
Mathias_Peters
Contributor II
  • 2 kudos

Hi, we are having similar issues with schema hints formulated in fully qualified DDL, e.g. "a STRUCT<b INT>" etc. Did you find a solution? Also, did you specify the schema hint using the dot-notation, e.g. "a.b INT" before ingesting any data or after...

  • 2 kudos
1 More Replies
anshi_t_k
by New Contributor III
  • 1342 Views
  • 4 replies
  • 0 kudos

Practice question for data engineer exam

A data engineer, User A, has promoted a pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both user...

  • 1342 Views
  • 4 replies
  • 0 kudos
Latest Reply
rakeshdey
New Contributor II
  • 0 kudos

The answer should be answer B, when you try to get job run information, creator_user_email id always populated as 'Run As' in workflow , so which credential used to trigger job.. if you get workflow infor through rest api then Ans A correct

  • 0 kudos
3 More Replies
Karthik_2
by New Contributor
  • 992 Views
  • 1 replies
  • 0 kudos

Query on SQL Warehouse Concurrency in Azure Databricks

Hi,We are planning to migrate the backend of our web application, currently hosted on App Service with an Azure SQL Database, to Azure Databricks as the data source. For this, we intend to use the SQL Warehouse in Databricks to execute queries and in...

  • 992 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

Hello Karthik, many thanks for your question. Databricks SQL Warehouses use dynamic concurrency to handle varying demands. Unlike static-capacity warehouses, Databricks SQL adjusts compute resources in real time to manage concurrent loads and maximiz...

  • 0 kudos
tseader
by New Contributor III
  • 2508 Views
  • 3 replies
  • 1 kudos

Resolved! Python SDK clusters.create_and_wait - Sourcing from cluster-create JSON

I am attempting to create a compute cluster using the Python SDK while sourcing a cluster-create configuration JSON file, which is how it's done for the databricks-cli and what databricks provides through the GUI.  Reading in the JSON as a Dict fails...

  • 2508 Views
  • 3 replies
  • 1 kudos
Latest Reply
tseader
New Contributor III
  • 1 kudos

@Retired_mod The structure of the `cluster-create.json` is perfectly fine.  The issue is as stated above related to the structure is that the SDK does not allow nested structures from the JSON file to be used, and instead they need to be cast to spec...

  • 1 kudos
2 More Replies
praful
by New Contributor II
  • 2968 Views
  • 5 replies
  • 1 kudos

Recover Lost Notebook

Hi Team, I was using Databricks community edition for learning purpose. I had an account https://community.cloud.databricks.com/?o=6822095545287159 where I stored all my learning notebooks. Unfortunately, this account suddenly stopped working, and I ...

  • 2968 Views
  • 5 replies
  • 1 kudos
Latest Reply
Walter_C
Databricks Employee
  • 1 kudos

The workspace id you have shared seems to be related to a workspace which is still in running state, if you missed the login access to this workspace then our team you have reached over email would be able to assist.I will add the following doc for s...

  • 1 kudos
4 More Replies
minhhung0507
by Valued Contributor
  • 1086 Views
  • 7 replies
  • 4 kudos

Resolved! How to reduce cost of "Regional Standard Class A Operations"

Hi Databricks experts,We're experiencing unexpectedly high costs from Regional Standard Class A Operations in GCS while running a Databricks pipeline. The costs seem related to frequent metadata queries, possibly tied to Delta table operations.In las...

image.png
  • 1086 Views
  • 7 replies
  • 4 kudos
Latest Reply
VZLA
Databricks Employee
  • 4 kudos

@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these j...

  • 4 kudos
6 More Replies
143260
by New Contributor
  • 10737 Views
  • 2 replies
  • 1 kudos

Convert SQL Query to Dataframe

Hello,Being relatively new to the Databricks world, I'm hoping someone can show me how to take a SQL query and put the results into a dataframe. As part of data validation project, I'd like to cross join two dataframes.

  • 10737 Views
  • 2 replies
  • 1 kudos
Latest Reply
Antoine_B
Contributor
  • 1 kudos

From a pyspark notebook, you could do:df = spark.sql("SELECT * FROM my_table WHERE ...")Then you can use this df and crossjoin it to another DataFrameIf you are new to databricks, I suggest you should follow some of the self paced lessons in databric...

  • 1 kudos
1 More Replies
zed
by New Contributor III
  • 1254 Views
  • 6 replies
  • 0 kudos

Resolved! ConcurrentAppendException in Feature Engineering write_table

I am using the Feature Engineering client when writing to a time series feature table. Then I have cried two data bricks jobs with the below code. I am running with different run_dates (e.g. '2016-01-07' and '2016-01-08'). When they run concurrently,...

  • 1254 Views
  • 6 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@zed Clustering by your date column can indeed help avoid the ConcurrentAppendException without incurring the strict partitioning constraints that a “time series feature table” normally disallows. Unlike partitioning, CLUSTER BY does not create physi...

  • 0 kudos
5 More Replies
Einsatz
by New Contributor II
  • 1392 Views
  • 4 replies
  • 2 kudos

Resolved! Photon enabled UC cluster has less executor memory(1/4th) compared to normal cluster.

I have a Unity Catalog Enabled cluster with Node type Standard_DS4_v2 (28 GB Memory, 8 Cores). When "Use Photon Acceleration" option is disabled spark.executor.memory is 18409m. But if I enable Photon Acceleration it shows spark.executor.memory as 46...

  • 1392 Views
  • 4 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

The memory allocated to the Photon engine is not fixed; it is based on a percentage of the node’s total memory. To calculate the value of spark.executor.memory based on a specific node type, you can use the following formula: container_size = (vm_si...

  • 2 kudos
3 More Replies
TejeshS
by New Contributor III
  • 1596 Views
  • 1 replies
  • 1 kudos

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.Objective: The goal is to determine which columns to select based on th...

  • 1596 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @TejeshS, Thanks for your post! To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can signif...

  • 1 kudos
guiferviz
by New Contributor III
  • 2256 Views
  • 7 replies
  • 3 kudos

Resolved! How to Determine if Materialized View is Performing Full or Incremental Refresh?

I'm currently testing materialized views and I need some help understanding the refresh behavior. Specifically, I want to know if my materialized view is querying the full table (performing a full refresh) or just doing an incremental refresh.From so...

  • 2256 Views
  • 7 replies
  • 3 kudos
Latest Reply
TejeshS
New Contributor III
  • 3 kudos

To validate the status of your materialized view (MV) refresh, run a DESCRIBE EXTENDED command and check the row corresponding to the "last refresh status type."RECOMPUTE indicates a full load execution was completed.NO_OPERATION means no operation w...

  • 3 kudos
6 More Replies
PiotrM
by New Contributor III
  • 662 Views
  • 2 replies
  • 0 kudos

Canceling long running on UC-enabled all-purpose clusters

Hey, as in the subject. Is it possible to set timeout for long running queries on all-purpose clusters that are UC enabled? I know there is such setting for SQL Warehouses and Workflows, but I was unable to find one for all-purpose clusters. The issu...

  • 662 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@PiotrM thanks for your question! Adding to @Alberto_Umana comment, could you please clarify what do you mean with: "I tried thing like spark.task.reaper.killTimeout, but it seems like UC clusters won't accept it." ? Is it throwing an error or is it ...

  • 0 kudos
1 More Replies
berserkersap
by Contributor
  • 6571 Views
  • 4 replies
  • 1 kudos

Speed Up JDBC Write from Databricks Notebook to MS SQL Server

Hello Everyone,I have a use case where I need to write a delta table from DataBricks to a SQL Server Table using Pyspark/ python/ spark SQL .The delta table I am writing contains around 3 million records and the SQL Server Table is neither partitione...

Data Engineering
JDBC
MS SQL Server
pyspark
Table Write
  • 6571 Views
  • 4 replies
  • 1 kudos
Latest Reply
VZLA
Databricks Employee
  • 1 kudos

@berserkersap have you had time to identify where's the bottleneck? e.g.: sequential writes, network latency/throughput, or maybe you have a connection pool in the target much lower than the number of connection threads in the source?

  • 1 kudos
3 More Replies
guangyi
by Contributor III
  • 795 Views
  • 2 replies
  • 0 kudos

How to identify the mandatory fields of the create clusters API

After several attempts I found some mandatory fields for cluster creation API: num_workers, spark_version, node_type_id. I’m not finding these fields directly against the API but via job cluster definition in the asset bundle yaml file.I ask the Chat...

  • 795 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@guangyi thanks for your question! I understand your concerns. Looking through the docs I could only find a few with the "required" metadata tag, while most seem to be implicitly assumed, e.g.: singleNode with num_workers 0, and similar requirements....

  • 0 kudos
1 More Replies
vivek_cloudde
by New Contributor III
  • 2300 Views
  • 8 replies
  • 2 kudos

Resolved! Issue while creating on-demand cluster in azure databricks using pyspark

Hello,I am trying to create an on demand cluster in azure databricks using below code and i am getting the error message{"error_code":"INVALID_PARAMETER_VALUE","message":"Exactly 1 of virtual_cluster_size, num_workers or autoscale must be specified."...

  • 2300 Views
  • 8 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@vivek_cloudde I still find it interesting to know that for all these different misconfigurations or wrong cluster definitions, you got the same error message, but anyways, happy to hear it worked ! If it helps, next time and to make things simpler, ...

  • 2 kudos
7 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels