cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Harsh1
by New Contributor II
  • 959 Views
  • 2 replies
  • 1 kudos

Query on DBFS migration

We are doing DBFS migration. In that we have a folder 'user' in Root DBFS having data 5.8 TB in legacy workspace. We performed AWS CLi Sync/cp between Legacy to Target and again performed the same between Target bucket to Target dbfs   While implemen...

  • 959 Views
  • 2 replies
  • 1 kudos
Latest Reply
Harsh1
New Contributor II
  • 1 kudos

Thanks for the quick response.Regarding the suggested AWS data sync approach, we have tried data sync in multiple ways, it is creating folders in s3 bucket itself not on DBFS. As our task is to copy from bucket to DBFS.It seems that it only supports ...

  • 1 kudos
1 More Replies
_Orc
by New Contributor
  • 12815 Views
  • 6 replies
  • 3 kudos

Resolved! Precision and scale is getting changed in the dataframe while casting to decimal

When i run the below query in databricks sql the Precision and scale of the decimal column is getting changed.Select typeof(COALESCE(Cast(3.45 as decimal(15,6)),0));o/p: decimal(16,6)expected o/p: decimal(15,6)Any reason why the Precision and scale i...

  • 12815 Views
  • 6 replies
  • 3 kudos
Latest Reply
berserkersap
Contributor
  • 3 kudos

You can use typeof(COALESCE(Cast(3.45 as decimal(15,6)),0.0)); (instead of 0)

  • 3 kudos
5 More Replies
shan_chandra
by Esteemed Contributor
  • 3837 Views
  • 1 replies
  • 1 kudos

Resolved! Insert query fails with error "The query is not executed because it tries to launch ***** tasks in a single stage, while maximum allowed tasks one query can launch is 100000;

Py4JJavaError: An error occurred while calling o236.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:201) at org.apache.spark.sql.execution.datasources.I...

  • 3837 Views
  • 1 replies
  • 1 kudos
Latest Reply
shan_chandra
Esteemed Contributor
  • 1 kudos

could you please increase the below config (at the cluster level) to a higher value or set it to zero spark.databricks.queryWatchdog.maxQueryTasks 0The spark config while it alleviates the issue.

  • 1 kudos
Raymond_Garcia
by Contributor II
  • 2633 Views
  • 3 replies
  • 5 kudos

Resolved! Manipulate Column that is an array of objects

I have a column that is an array of objects, let's call it ARRAY, and now I would like to query / manipulate, the elements object without using explode function, this is an example, for each element in that column I would like to create a path. .wit...

  • 2633 Views
  • 3 replies
  • 5 kudos
Latest Reply
Raymond_Garcia
Contributor II
  • 5 kudos

Hello I am working with Scala, and I used somehing similar:def play(col: Column): Column = { concat_ws("", lit(imagePath), lit("/"), col("field1"), lit("/"), col("field2"), lit(".ext"))}val variable = spark.lot_of_stuff.                 .withColumn("...

  • 5 kudos
2 More Replies
TS
by New Contributor III
  • 2700 Views
  • 3 replies
  • 3 kudos

Resolved! Turn spark.sql query into scala function

Hello,I'm learning Scala / Spark and try to understand what's wrong with my function:I have a spark.sql query, stored in a variable:val uViewName = spark.sql(""" SELECT v.Data_View_Name FROM apoHierarchy AS h INNER JOIN apoView AS v ON h.View_N...

  • 2700 Views
  • 3 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

try add .first()(0) it will return only value from first row/column as currently you are returning Dataset: var uViewName = spark.sql(s""" SELECT v.Data_View_Name FROM apoHierarchy AS h INNER JOIN apoView AS v ON h.View_Name = v.Context_View_N...

  • 3 kudos
2 More Replies
alejandrofm
by Valued Contributor
  • 6982 Views
  • 11 replies
  • 1 kudos

Resolved! How can I view the query history, duration, etc for all users

Hi! I have some jobs that stay idle for some time when getting data from a S3 mount on DBFS, this are all SQL queries on Delta, how can I know where is the bottle neck, duration, cue? to diagnose the slow spark performance that I think is on the proc...

  • 6982 Views
  • 11 replies
  • 1 kudos
Latest Reply
alejandrofm
Valued Contributor
  • 1 kudos

We found out we were regeneratig the symlink manifest for all the partitions on this case. And for some reason it was executed twice, at start and end of the job.delta_table.generate('symlink_format_manifest')We configured the table with:ALTER TABLE ...

  • 1 kudos
10 More Replies
prasadvaze
by Valued Contributor II
  • 13433 Views
  • 14 replies
  • 12 kudos

Resolved! How to query delta lake using SQL desktop tools like SSMS or DBVisualizer

Is there a way to use sql desktop tools? because delta OSS or databricks does not provide desktop client (similar to azure data studio) to browse and query delta lake objects.I currently use databricks SQL , a webUI in the databricks workspace but se...

  • 13433 Views
  • 14 replies
  • 12 kudos
Latest Reply
prasadvaze
Valued Contributor II
  • 12 kudos

DSR is Delta Standalone Reader. see more here - https://docs.delta.io/latest/delta-standalone.htmlIts a crate (and also now a py library) that allows you to connect to delta tables without using spark (e.g. directly from python and not using pyspa...

  • 12 kudos
13 More Replies
LukaszJ
by Contributor III
  • 7838 Views
  • 5 replies
  • 0 kudos

Resolved! Send UPDATE from Databricks to Azure SQL DataBase

Hello.I want to know how to do an UPDATE on Azure SQL DataBase from Azure Databricks using PySpark.I know how to make query as SELECT and turn it into DataFrame, but how to send back some data (as UPDATE on rows)?I want to use build in pyspark istead...

  • 7838 Views
  • 5 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

This is discussed on Stack Overflow. As you see for Azure Synapse there is a way, but for a plain SQL database you will have to use some kind of driver like odbc/jdbc.

  • 0 kudos
4 More Replies
Ian
by New Contributor III
  • 3256 Views
  • 6 replies
  • 0 kudos

Resolved! Databricks-Connect and Change Data Feed query error

I have installed Databricks-Connect (9.1 LTS). I am able to send queries to the cluster. However, when the query includes a call to the 'table_changes' function that is a part of Change Data Feed, I get the following error:AnalysisException("could ...

  • 3256 Views
  • 6 replies
  • 0 kudos
Latest Reply
Ian
New Contributor III
  • 0 kudos

Hi @Kaniz Fatma​ , the table_changes function is an internal Databricks function used in Change Data Feed (CDF).Please refer to the article below. It discusses the table_changes function.https://docs.databricks.com/delta/delta-change-data-feed.html

  • 0 kudos
5 More Replies
Soma
by Valued Contributor
  • 1455 Views
  • 4 replies
  • 2 kudos

Resolved! Query RestAPI end point in Databricks Standard Workspace

Do we have option to query delta table using Standard Workspace as a endpoint instead of JDBC

  • 1455 Views
  • 4 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@somanath Sankaran​ - Would you be happy to mark @Hubert Dudek​'s answer as best if it resolved the problem? That helps other members who are searching for answers find the solution more quickly.

  • 2 kudos
3 More Replies
omsas
by New Contributor
  • 1930 Views
  • 2 replies
  • 0 kudos

How to add Columns for Automatic Fill on Pandas Python

1. I have data x,I would like to create a new column with the condition that the value are 1, 2 or 32. The name of the column is SHIFT where this SHIFT column will be filled automatically if the TIME_CREATED column meets the conditions.3. the conditi...

Columns Table Result of tested
  • 1930 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

You an do something like this in pandas. Note there could be a more performant way to do this too. import pandas as pd import numpy as np   df = pd.DataFrame({'a':[1,2,3,4]}) df.head() > a > 0 1 > 1 2 > 2 3 > 3 4   conditions = [(df['a'] <=2...

  • 0 kudos
1 More Replies
Sam
by New Contributor III
  • 2528 Views
  • 2 replies
  • 1 kudos

Resolved! Query Pushdown in Snowflake

Hi,I am wondering what documentation exists on Query Pushdown in Snowflake.I noticed that a single function (monitonically_increasing_id()) prevented the entire query being pushed down to Snowflake during an ETL process. Is Pushdown coming from the S...

  • 2528 Views
  • 2 replies
  • 1 kudos
Latest Reply
siddhathPanchal
New Contributor III
  • 1 kudos

Hi Sam,The Spark Connector applies predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. When the data source is Snowflake, the operations are translated into a SQL query and then executed in Snowflake to...

  • 1 kudos
1 More Replies
Anonymous
by Not applicable
  • 633 Views
  • 1 replies
  • 0 kudos

Photon usage

How do I know how much of a query/job used Photon?

  • 633 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

If you are using Photon on Databricks SQLClick the Query History icon on the sidebar.Click the line containing the query you’d like to analyze.On the Query Details pop-up, click Execution Details.Look at the Task Time in Photon metric at the bottom.

  • 0 kudos
Labels