cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

hkmodi
by New Contributor II
  • 430 Views
  • 3 replies
  • 0 kudos

Perform row_number() filter in autoloader

I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table...

  • 430 Views
  • 3 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

HI @hkmodi ,Basically, as @daniel_sahal  said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:- use me...

  • 0 kudos
2 More Replies
vibhakar
by New Contributor
  • 4622 Views
  • 3 replies
  • 1 kudos

Not able to mount ADLS Gen2 in Data bricks

py4j.security.Py4JSecurityException: Method public com.databricks.backend.daemon.dbutils.DBUtilsCore$Result com.databricks.backend.daemon.dbutils.DBUtilsCore.mount(java.lang.String,java.lang.String,java.lang.String,java.lang.String,java.util.Map) is ...

  • 4622 Views
  • 3 replies
  • 1 kudos
Latest Reply
cpradeep
New Contributor III
  • 1 kudos

Hi , have you sorted this issue ? can you please let me know the solution? 

  • 1 kudos
2 More Replies
fabien_arnaud
by New Contributor II
  • 983 Views
  • 6 replies
  • 0 kudos

Data shifted when a pyspark dataframe column only contains a comma

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.When displaying the dataframe with the commanddisplay(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838')) The data is dis...

  • 983 Views
  • 6 replies
  • 0 kudos
Latest Reply
MilesMartinez
New Contributor II
  • 0 kudos

Thank you so much for the solution.

  • 0 kudos
5 More Replies
oakhill
by New Contributor III
  • 229 Views
  • 1 replies
  • 0 kudos

How to optimize queries on a 150B table? ZORDER, LC or partioning?

Hi!I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.Selecting max...

  • 229 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

1. According to the databricks yes But as always, I recommend to perform benchamarks yourself. There a lot of blog posts, that are saying that it's not alway the case. Yesterday, I was at data community event and presenter did several benchmark and ...

  • 0 kudos
AlvaroCM
by New Contributor III
  • 640 Views
  • 2 replies
  • 0 kudos

Resolved! DLT error at validation

Hello,I'm creating a DLT pipeline with Databricks on AWS. After creating an external location for my bucket, I encountered the following error:DataPlaneException: [DLT ERROR CODE: CLUSTER_LAUNCH_FAILURE.CLIENT_ERROR] Failed to launch pipeline cluster...

  • 640 Views
  • 2 replies
  • 0 kudos
Latest Reply
AlvaroCM
New Contributor III
  • 0 kudos

Hi!The error was related to the roles and permissions created when the workspace was set up. I reloaded the setup script in a new workspace, and it worked without problems.Hope it helps anyone in the future.Thanks!

  • 0 kudos
1 More Replies
AntonDBUser
by New Contributor III
  • 448 Views
  • 1 replies
  • 1 kudos

Lakehouse Federation with OAuth connection to Snowflake

Hi!We have a lot use cases were we need to load data from Snowflake into Databricks, where users are using both R and Python for further analysis and machine learning. For this we have been using Lakehouse Federation combined with basic auth, but are...

  • 448 Views
  • 1 replies
  • 1 kudos
Latest Reply
AntonDBUser
New Contributor III
  • 1 kudos

For anyone interested: We solved this by building an OAuth integration to Snowflake ourselfs using Entra ID: https://community.snowflake.com/s/article/External-oAuth-Token-Generation-using-Azure-ADWe also created some simple Python and R-packages tha...

  • 1 kudos
JonHMDavis
by New Contributor II
  • 4873 Views
  • 5 replies
  • 2 kudos

Graphframes not importing on Databricks 9.1 LTS ML

Is Graphframes for python meant to be installed by default on Databricks 9.1 LTS ML? Previously I was running the attached python command on 7.3 LTS ML with no issue, however now I am getting "no module named graphframes" when trying to import the pa...

  • 4873 Views
  • 5 replies
  • 2 kudos
Latest Reply
malz
New Contributor II
  • 2 kudos

Hi @MuthuLakshmi ,  As per the documentation it was mentioned that graphframes comes preinstalled in databricks runtime for machine learning. but when trying to import the python module of graphframes, getting no module found error.from graphframes i...

  • 2 kudos
4 More Replies
naveenreddy1
by New Contributor II
  • 18310 Views
  • 4 replies
  • 0 kudos

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

We are using the databricks 3 node cluster with 32 GB memory. It is working fine but some times it automatically throwing the error: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

  • 18310 Views
  • 4 replies
  • 0 kudos
Latest Reply
RodrigoDe_Freit
New Contributor II
  • 0 kudos

If your job fails follow this:According to https://docs.databricks.com/jobs.html#jar-job-tips: "Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and ma...

  • 0 kudos
3 More Replies
him
by New Contributor III
  • 17371 Views
  • 10 replies
  • 7 kudos

i am getting the below error while making a GET request to job in databrick after successfully running it

"error_code": "INVALID_PARAMETER_VALUE",  "message": "Retrieving the output of runs with multiple tasks is not supported. Please retrieve the output of each individual task run instead."}

Capture
  • 17371 Views
  • 10 replies
  • 7 kudos
Latest Reply
SANKET
New Contributor II
  • 7 kudos

Use https://<databricks-instance>/api/2.1/jobs/runs/get?run_id=xxxx."get-output" gives the details of single run id which is associated with the task but not the Job.

  • 7 kudos
9 More Replies
ArturOA
by New Contributor III
  • 1339 Views
  • 7 replies
  • 0 kudos

Attaching to Serverless from Azure Data Factory via Service Principal

Hi,We have issues trying to run Databricks notebooks orchestrated with Azure Data Factory. We have been doing this for a while now without any issues when we use Job Clusters, Existing General Purpose Clusters, or Cluster Pools. We use an Azure Data ...

ArturOA_0-1729677593083.png
  • 1339 Views
  • 7 replies
  • 0 kudos
Latest Reply
h_h_ak
Contributor
  • 0 kudos

Does the service principal has access and permission for the notebook?

  • 0 kudos
6 More Replies
HamidHamid_Mora
by New Contributor II
  • 3222 Views
  • 4 replies
  • 3 kudos

ganglia is unavailable on DBR 13.0

We created a library in databricks to ingest ganglia metrics for all jobs in our delta tables;However end point 8652 is no more available on DBR 13.0is there any other endpoint available ? since we need to log all metrics for all executed jobs not on...

  • 3222 Views
  • 4 replies
  • 3 kudos
Latest Reply
h_h_ak
Contributor
  • 3 kudos

You should have a look here: https://community.databricks.com/t5/data-engineering/azure-databricks-metrics-to-prometheus/td-p/71569

  • 3 kudos
3 More Replies
amanda3
by New Contributor II
  • 459 Views
  • 3 replies
  • 0 kudos

Flattening JSON while also keep embedded types

I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this: sales_schema = StructType([ StructField("customer_id", IntegerType(), True), StructField("order_numbers",...

  • 459 Views
  • 3 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON. from pyspark.sql.functions import col, from_json, expr from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongTy...

  • 0 kudos
2 More Replies
xhudik
by New Contributor III
  • 406 Views
  • 1 replies
  • 1 kudos

Resolved! does stream.stop() generates "ERROR: Query termination received for []" automatically?

Whenever code contains stream.stop() in STDERR (in cluster logs) I get an error like:ERROR: Query termination received for [id=b7e14d07-f8ad-4ae6-99de-8a7cbba89d86, runId=5c01fd71-2d93-48ca-a53c-5f46fab726ff]No other message, even if I try to try-cat...

  • 406 Views
  • 1 replies
  • 1 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 1 kudos

@xhudik does stream.stop() generates "ERROR: Query termination received for []" automatically?Yes, this is generated when there is stream.stop() in stdderIs ERROR: Query termination received for [] dangerous, or it is just ans info stream was closed?...

  • 1 kudos
roberta_cereda
by New Contributor
  • 432 Views
  • 1 replies
  • 0 kudos

Describe history operationMetrics['materializeSourceTimeMs']

Hi, during some checks on MERGE execution , I was running the describe history command and in the operationMetrics column I noticed this information :  operationMetrics['materializeSourceTimeMs'] .I haven't found that metric in the documentation so I...

  • 432 Views
  • 1 replies
  • 0 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 0 kudos

@roberta_cereda  If it’s specific to “materializeSourceTimeMs” then it’s “time taken to materialize source (or determine it's not needed)”

  • 0 kudos
pranav_k1
by New Contributor III
  • 716 Views
  • 3 replies
  • 1 kudos

Resolved! Error while loading mosaic in notebook - TimeoutException: Futures timed out after [80 seconds]

I am working on reading spatial data with mosaic and gdal Previously I used databricks mosaic = 0.3.9 version with databricks cluster = 12.2 LTS version With following command - %pip install databricks-mosaic==0.3.9 --quiet Now It's giving timeout er...

  • 716 Views
  • 3 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @pranav_k1,Thanks for confirming it worked for you now!I see that the usual %pip install databricks-mosaic cannot install due to the fact that it has thus far allowed geopandas to essentially install the latest... As of geopandas==0.14.4, the vers...

  • 1 kudos
2 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels