cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

hkmodi
by New Contributor II
  • 2572 Views
  • 3 replies
  • 0 kudos

Perform row_number() filter in autoloader

I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table...

  • 2572 Views
  • 3 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

HI @hkmodi ,Basically, as @daniel_sahal  said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:- use me...

  • 0 kudos
2 More Replies
vibhakar
by New Contributor
  • 5894 Views
  • 3 replies
  • 1 kudos

Not able to mount ADLS Gen2 in Data bricks

py4j.security.Py4JSecurityException: Method public com.databricks.backend.daemon.dbutils.DBUtilsCore$Result com.databricks.backend.daemon.dbutils.DBUtilsCore.mount(java.lang.String,java.lang.String,java.lang.String,java.lang.String,java.util.Map) is ...

  • 5894 Views
  • 3 replies
  • 1 kudos
Latest Reply
cpradeep
New Contributor III
  • 1 kudos

Hi , have you sorted this issue ? can you please let me know the solution? 

  • 1 kudos
2 More Replies
fabien_arnaud
by New Contributor II
  • 2359 Views
  • 6 replies
  • 0 kudos

Data shifted when a pyspark dataframe column only contains a comma

I have a dataframe containing several columns among which 1 contains, for one specific record, just a comma, nothing else.When displaying the dataframe with the commanddisplay(df_input.where(col("erp_vendor_cd") == 'B6SA-VEN0008838')) The data is dis...

  • 2359 Views
  • 6 replies
  • 0 kudos
Latest Reply
MilesMartinez
New Contributor II
  • 0 kudos

Thank you so much for the solution.

  • 0 kudos
5 More Replies
oakhill
by New Contributor III
  • 633 Views
  • 1 replies
  • 0 kudos

How to optimize queries on a 150B table? ZORDER, LC or partioning?

Hi!I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.Selecting max...

  • 633 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

1. According to the databricks yes But as always, I recommend to perform benchamarks yourself. There a lot of blog posts, that are saying that it's not alway the case. Yesterday, I was at data community event and presenter did several benchmark and ...

  • 0 kudos
AlvaroCM
by New Contributor III
  • 1223 Views
  • 2 replies
  • 0 kudos

Resolved! DLT error at validation

Hello,I'm creating a DLT pipeline with Databricks on AWS. After creating an external location for my bucket, I encountered the following error:DataPlaneException: [DLT ERROR CODE: CLUSTER_LAUNCH_FAILURE.CLIENT_ERROR] Failed to launch pipeline cluster...

  • 1223 Views
  • 2 replies
  • 0 kudos
Latest Reply
AlvaroCM
New Contributor III
  • 0 kudos

Hi!The error was related to the roles and permissions created when the workspace was set up. I reloaded the setup script in a new workspace, and it worked without problems.Hope it helps anyone in the future.Thanks!

  • 0 kudos
1 More Replies
AntonDBUser
by New Contributor III
  • 1314 Views
  • 1 replies
  • 2 kudos

Lakehouse Federation with OAuth connection to Snowflake

Hi!We have a lot use cases were we need to load data from Snowflake into Databricks, where users are using both R and Python for further analysis and machine learning. For this we have been using Lakehouse Federation combined with basic auth, but are...

  • 1314 Views
  • 1 replies
  • 2 kudos
Latest Reply
AntonDBUser
New Contributor III
  • 2 kudos

For anyone interested: We solved this by building an OAuth integration to Snowflake ourselfs using Entra ID: https://community.snowflake.com/s/article/External-oAuth-Token-Generation-using-Azure-ADWe also created some simple Python and R-packages tha...

  • 2 kudos
JonHMDavis
by New Contributor II
  • 6311 Views
  • 5 replies
  • 2 kudos

Graphframes not importing on Databricks 9.1 LTS ML

Is Graphframes for python meant to be installed by default on Databricks 9.1 LTS ML? Previously I was running the attached python command on 7.3 LTS ML with no issue, however now I am getting "no module named graphframes" when trying to import the pa...

  • 6311 Views
  • 5 replies
  • 2 kudos
Latest Reply
malz
New Contributor II
  • 2 kudos

Hi @MuthuLakshmi ,  As per the documentation it was mentioned that graphframes comes preinstalled in databricks runtime for machine learning. but when trying to import the python module of graphframes, getting no module found error.from graphframes i...

  • 2 kudos
4 More Replies
naveenreddy1
by New Contributor II
  • 19576 Views
  • 4 replies
  • 0 kudos

Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Driver stacktrace

We are using the databricks 3 node cluster with 32 GB memory. It is working fine but some times it automatically throwing the error: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.

  • 19576 Views
  • 4 replies
  • 0 kudos
Latest Reply
RodrigoDe_Freit
New Contributor II
  • 0 kudos

If your job fails follow this:According to https://docs.databricks.com/jobs.html#jar-job-tips: "Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and ma...

  • 0 kudos
3 More Replies
ArturOA
by New Contributor III
  • 4594 Views
  • 7 replies
  • 0 kudos

Attaching to Serverless from Azure Data Factory via Service Principal

Hi,We have issues trying to run Databricks notebooks orchestrated with Azure Data Factory. We have been doing this for a while now without any issues when we use Job Clusters, Existing General Purpose Clusters, or Cluster Pools. We use an Azure Data ...

ArturOA_0-1729677593083.png
  • 4594 Views
  • 7 replies
  • 0 kudos
Latest Reply
h_h_ak
Contributor
  • 0 kudos

Does the service principal has access and permission for the notebook?

  • 0 kudos
6 More Replies
HamidHamid_Mora
by New Contributor II
  • 4440 Views
  • 4 replies
  • 3 kudos

ganglia is unavailable on DBR 13.0

We created a library in databricks to ingest ganglia metrics for all jobs in our delta tables;However end point 8652 is no more available on DBR 13.0is there any other endpoint available ? since we need to log all metrics for all executed jobs not on...

  • 4440 Views
  • 4 replies
  • 3 kudos
Latest Reply
h_h_ak
Contributor
  • 3 kudos

You should have a look here: https://community.databricks.com/t5/data-engineering/azure-databricks-metrics-to-prometheus/td-p/71569

  • 3 kudos
3 More Replies
amanda3
by New Contributor II
  • 1093 Views
  • 3 replies
  • 0 kudos

Flattening JSON while also keep embedded types

I'm attempting to create DLT tables from a source table that includes an "data" column that is a JSON string. I'm doing something like this: sales_schema = StructType([ StructField("customer_id", IntegerType(), True), StructField("order_numbers",...

  • 1093 Views
  • 3 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

To ensure that the "value" field retains its integer type, you can explicitly cast it after parsing the JSON. from pyspark.sql.functions import col, from_json, expr from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, LongTy...

  • 0 kudos
2 More Replies
xhudik
by New Contributor III
  • 1187 Views
  • 1 replies
  • 1 kudos

Resolved! does stream.stop() generates "ERROR: Query termination received for []" automatically?

Whenever code contains stream.stop() in STDERR (in cluster logs) I get an error like:ERROR: Query termination received for [id=b7e14d07-f8ad-4ae6-99de-8a7cbba89d86, runId=5c01fd71-2d93-48ca-a53c-5f46fab726ff]No other message, even if I try to try-cat...

  • 1187 Views
  • 1 replies
  • 1 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 1 kudos

@xhudik does stream.stop() generates "ERROR: Query termination received for []" automatically?Yes, this is generated when there is stream.stop() in stdderIs ERROR: Query termination received for [] dangerous, or it is just ans info stream was closed?...

  • 1 kudos
roberta_cereda
by New Contributor
  • 946 Views
  • 1 replies
  • 0 kudos

Describe history operationMetrics['materializeSourceTimeMs']

Hi, during some checks on MERGE execution , I was running the describe history command and in the operationMetrics column I noticed this information :  operationMetrics['materializeSourceTimeMs'] .I haven't found that metric in the documentation so I...

  • 946 Views
  • 1 replies
  • 0 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 0 kudos

@roberta_cereda  If it’s specific to “materializeSourceTimeMs” then it’s “time taken to materialize source (or determine it's not needed)”

  • 0 kudos
pranav_k1
by New Contributor III
  • 2024 Views
  • 3 replies
  • 1 kudos

Resolved! Error while loading mosaic in notebook - TimeoutException: Futures timed out after [80 seconds]

I am working on reading spatial data with mosaic and gdal Previously I used databricks mosaic = 0.3.9 version with databricks cluster = 12.2 LTS version With following command - %pip install databricks-mosaic==0.3.9 --quiet Now It's giving timeout er...

  • 2024 Views
  • 3 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @pranav_k1,Thanks for confirming it worked for you now!I see that the usual %pip install databricks-mosaic cannot install due to the fact that it has thus far allowed geopandas to essentially install the latest... As of geopandas==0.14.4, the vers...

  • 1 kudos
2 More Replies
DmitriyLamzin
by New Contributor II
  • 4639 Views
  • 2 replies
  • 1 kudos

applyInPandas function hangs in runtime 13.3 LTS ML and above

Hello, recently I've tried to upgrade my runtime env to the 13.3 LTS ML and found that it breaks my workload during applyInPandas.My job started to hang during the applyInPandas execution. Thread dump shows that it hangs on direct memory allocation: ...

Data Engineering
pandas udf
  • 4639 Views
  • 2 replies
  • 1 kudos
Latest Reply
Marcin_Milewski
New Contributor II
  • 1 kudos

Hi @Debayan the link just redirects to the same thread? Is there any update on this issue?We share some similar issue on job hanging using mapInPandas.   

  • 1 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels