cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

ws4100e
by New Contributor III
  • 2894 Views
  • 8 replies
  • 0 kudos

DLT piplines with UC

I try to run a (very simple) DLT pipeline in with a resulting materialized table is published in UC schema with a managed storage location defined (within an existing EXTERNAL LOCATION). Accoding to the documentation: Publishing to schemas that speci...

  • 2894 Views
  • 8 replies
  • 0 kudos
Latest Reply
DataGeek_JT
New Contributor II
  • 0 kudos

Did this get resolved?  I am getting the same issue.

  • 0 kudos
7 More Replies
Phani1
by Valued Contributor
  • 304 Views
  • 1 replies
  • 0 kudos

Databricks Platform Cleanup and baseline activities.

Hi Team, Kindly share the best practices for managing Databricks Platform Cleanup and baseline activities.

  • 304 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Phani1, Here are some best practices for managing Databricks Platform Cleanup and baseline activities: Platform Administration: Regularly monitor and manage your Databricks platform to ensure optimal performance.Compute Creation: Choose the ri...

  • 0 kudos
dataslicer
by Contributor
  • 649 Views
  • 2 replies
  • 0 kudos

How to export/clone Databricks Notebook without results via web UI?

When a Databricks Notebook exceeds size limit, it suggests to `clone/export without results`.  This is exactly what I want to do, but the current web UI does not provide the ability to bypass/skip the results in either the `clone` or `export` context...

  • 649 Views
  • 2 replies
  • 0 kudos
Latest Reply
dataslicer
Contributor
  • 0 kudos

Thank you @Yeshwanth for the response. I am looking for a way without clearing up the current outputs. This is necessary because I want to preserve the existing outputs and fork off another notebook instance to run with few parameter changes and come...

  • 0 kudos
1 More Replies
Ramana
by Contributor
  • 1278 Views
  • 3 replies
  • 0 kudos

SHOW GROUPS is not giving groups available at the account level

I am trying to capture all the Databricks groups and their mapping to user/ad group(s).I tried to do this by using show groups, show users, and show grants by following the examples mentioned in the below article but the show groups command only fetc...

  • 1278 Views
  • 3 replies
  • 0 kudos
Latest Reply
Ramana
Contributor
  • 0 kudos

Yes, I can use the Rest API but I am looking for a SQL or Programming way to do this rather than doing the API calls and building the Comex Datatype Dataframe and then saving it as a Table.ThanksRamana

  • 0 kudos
2 More Replies
kseyser
by New Contributor II
  • 689 Views
  • 2 replies
  • 1 kudos

Predicting compute required to run Spark jobs

Im working on a project to predict compute (cores) required to run spark jobs. Has anyone work on this or something similar before? How did you get started? 

  • 689 Views
  • 2 replies
  • 1 kudos
Latest Reply
Yeshwanth
Honored Contributor
  • 1 kudos

@kseyser good day, This documentation might help you in your use-case: https://docs.databricks.com/en/compute/cluster-config-best-practices.html#compute-sizing-considerations Kind regards, Yesh

  • 1 kudos
1 More Replies
Lea
by New Contributor II
  • 5913 Views
  • 1 replies
  • 2 kudos

Resolved! Advice for generic file processing for ingestion of multiple data formats

Hello,We are using delta live tables to ingest data from multiple business groups, each with different input file formats and parsing requirements.  The input files are ingested from azure blob storage.  Right now, we are only servicing three busines...

  • 5913 Views
  • 1 replies
  • 2 kudos
Latest Reply
raphaelblg
Honored Contributor
  • 2 kudos

Hello @Lea , I'd like to inform you that our platform does not currently provide a built-in feature for ingesting multiple or interchangeable file formats. However, we highly value your input and encourage you to share your ideas through Databricks' ...

  • 2 kudos
thiagoawstest
by Contributor
  • 8476 Views
  • 3 replies
  • 2 kudos

Resolved! Migration Azure to AWS

Hello, today I use Azure Databricks, I want to migrate my wordspaces to AWS Databricks. What is the best practice, which path should I follow?, I didn't find anything in the documentation.thanks.

  • 8476 Views
  • 3 replies
  • 2 kudos
Latest Reply
thiagoawstest
Contributor
  • 2 kudos

Hello, as I already have a working Databricks environment on Azure, the best way would be to use tool-databricks-migrate?

  • 2 kudos
2 More Replies
orangepepino
by New Contributor II
  • 8322 Views
  • 2 replies
  • 1 kudos

SFTP connection using private key on Azure Databricks

I need to connect to a server to retrieve some files using spark and a private ssh key. However, to manage the private key safely I need to store it as a secret in Azure Key Vault, which means I don't have the key as a file to pass down in the keyFil...

  • 8322 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @orangepepino,  Instead of specifying the keyFilePath, you can pass the private key as a PEM string directly. This approach avoids the need for a physical key file.Since you’re already using Azure Key Vault, consider storing the private key as a s...

  • 1 kudos
1 More Replies
deng_dev
by New Contributor III
  • 649 Views
  • 3 replies
  • 0 kudos

Autoloader ignore one folder in path

Hi everyone!I am trying to setup Autoloader to read json file with specific name from all subfolders under the path except one.Could someone advice how this can be achieved? For example, I need to read from .../*/specific_name.json, but ignore test f...

  • 649 Views
  • 3 replies
  • 0 kudos
Latest Reply
standup1
New Contributor III
  • 0 kudos

I think you can use REGEXP to achieve this. This might not be the best way, but it should get the job done. It's all about filtering that file in the df from getting loaded. Try something like thisdf.select(“*”,”_metadata”).select(“*”,”_metadata.file...

  • 0 kudos
2 More Replies
Devsql
by New Contributor III
  • 876 Views
  • 3 replies
  • 2 kudos

How to find that given Parquet file got imported into Bronze Layer ?

Hi Team,Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parqu...

Data Engineering
Azure Databricks
Bronze Job
Delta Live Table
Delta Live Table Pipeline
  • 876 Views
  • 3 replies
  • 2 kudos
Latest Reply
raphaelblg
Honored Contributor
  • 2 kudos

Hello @Devsql , It appears that you are creating DLT bronze tables using a standard spark.read operation. This may explain why the DLT table doesn't include "new files" during a REFRESH operation. For incremental ingestion of bronze layer data into y...

  • 2 kudos
2 More Replies
shreya_20202
by New Contributor II
  • 904 Views
  • 1 replies
  • 1 kudos

copy file structure including files from one storage to another incrementally using pyspark

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:results search 03 Module19111.json Module19126.json 04 Module11291...

  • 904 Views
  • 1 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @shreya_20202, It looks like you’re trying to incrementally copy data from the source container to the destination container in Azure Databricks. To achieve this, you’ll need to compare the files in the source and destination directories and co...

  • 1 kudos
youssefmrini
by Honored Contributor III
  • 728 Views
  • 0 replies
  • 2 kudos

Delta Lake Liquid Clustering

Support for liquid clustering is now generally available using Databricks Runtime +15.2 Getting started with Delta Lake Liquid clustering https://lnkd.in/eaCZyhbF#DeltaLake #Databricks

  • 728 Views
  • 0 replies
  • 2 kudos
thecodecache
by New Contributor II
  • 673 Views
  • 3 replies
  • 0 kudos

Transpile a SQL Script into PySpark DataFrame API equivalent code

Input SQL Script (assume any dialect) : SELECT b.se10, b.se3, b.se_aggrtr_indctr, b.key_swipe_ind FROM (SELECT se10, se3, se_aggrtr_indctr, ROW_NUMBER() OVER (PARTITION BY SE10 ...

  • 673 Views
  • 3 replies
  • 0 kudos
Latest Reply
thecodecache
New Contributor II
  • 0 kudos

Hi @Kaniz_Fatma, Thanks for your response. I'm looking for a utility or an automated way of translating any generic SQL into PySpark DataFrame code.So, the translate function should look like below:def translate(input_sql):    # translate/convert it ...

  • 0 kudos
2 More Replies
pjv
by New Contributor III
  • 869 Views
  • 2 replies
  • 0 kudos

Asynchronous API calls from Databricks Workflow job

Hi all,I have many API calls to run on a python Databricks notebook which I then run regularly on a Databricks Workflow job. When I test the following code on an all purpose cluster locally i.e. not via a job, it runs perfectly fine. However, when I ...

  • 869 Views
  • 2 replies
  • 0 kudos
Latest Reply
pjv
New Contributor III
  • 0 kudos

I actually got it too work though I do see that if I run two jobs of the same code in parallel the async execution time slows down. Do the number of workers of the cluster on which the parallel jobs are run effect the execution time of async calls of...

  • 0 kudos
1 More Replies
Harsh-dataB
by New Contributor II
  • 541 Views
  • 2 replies
  • 1 kudos

Cluster termination using python script, sending return code 1

i have used a cluster termination logic for terminating a cluster , the issue is , the cluster is not terminating gracefully , returns a return/exit code 1The cluster is completing all the spark jobs, but it goes on long running state, hence i create...

  • 541 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Harsh-dataB, First, review your cluster termination logic. Make sure it accounts for all necessary cleanup tasks and allows sufficient time for Spark jobs to complete.If you’re using custom scripts or logic, ensure that it gracefully handles a...

  • 1 kudos
1 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels