cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

sanjay
by Valued Contributor II
  • 11022 Views
  • 3 replies
  • 1 kudos

Resolved! pyspark dropDuplicates performance issue

Hi,I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.df = df.dropDuplicates(["fileName"])Is there any better approach to d...

  • 11022 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @sanjay, When it comes to handling duplicate data in a PySpark DataFrame, there are more effective techniques available instead of relying on dropDuplicates().  Let's dive into some superior alternatives:  Utilizing dropDuplicates() with Column Su...

  • 1 kudos
2 More Replies
oishimbo
by New Contributor
  • 4563 Views
  • 2 replies
  • 0 kudos

Databricks time travel - how to get ALL changes ever done to a table

Hi time travel gurus,I am investigating creating a reporting solution with an AsOf functionality. Users will be able to create a report based on the current data or on the data AsOf some time ago. Due to the nature of our data this AsOf feature is qu...

  • 4563 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hey there! Thanks a bunch for being part of our awesome community!  We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution...

  • 0 kudos
1 More Replies
Miasu
by New Contributor II
  • 959 Views
  • 1 replies
  • 0 kudos

Unable to analyze external table | FileAlreadyExistsException

Hello experts, There's a csv file, "nyc_taxi.csv" saved under users/myfolder on DBFS, and I used this file created 2 tables:1. nyc_taxi : created using the UI, and it appeared as a managed table saved under dbfs:/user/hive/warehouse/mydatabase.db/nyc...

  • 959 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Miasu, As you executed the ANALYZE TABLE command for the nyc_taxi2 table, a FileAlreadyExistsException appeared, revealing that the desired path already exists and is not suitable for the operation. To find a resolution, let's delve into some pot...

  • 0 kudos
chrisf_sts
by New Contributor II
  • 7367 Views
  • 2 replies
  • 1 kudos

Resolved! After moving mounted s3 bucket under unity catalog control, python file paths no longer work

I have been using a mounted external s3 bucket with json files up until a few days ago, when my company changed to using all file mounts under control of the unity catalog.  Suddenly I can no loner run a command like:with open("/mnt/my_files/my_json....

  • 7367 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
1 More Replies
Boyan
by New Contributor II
  • 1325 Views
  • 2 replies
  • 0 kudos

Running unit tests and hyperopt causes a broadcast variable exception

Hello,We are using hyperopt to train a model with relatively large train dataset.We've experience some performance issues and following the suggestions in this notebook, we broadcasted the dataset.To verify that broadcasting the dataset resolved the ...

  • 1325 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Boyan, Here are few links for you.   Use Databricks Repos and Repos API: Databricks Repos allow cloning whole git repositories in Databricks and with the help of Repos API, you can automate this process by first cloning a git repository and then ...

  • 0 kudos
1 More Replies
tinai_long
by New Contributor III
  • 7550 Views
  • 10 replies
  • 4 kudos

Resolved! How to refresh a single table in Delta Live Tables?

Suppose I have a Delta Live Tables framework with 2 tables: Table 1 ingests from a json source, Table 2 reads from Table 1 and runs some transformation.In other words, the data flow is json source -> Table 1 -> Table 2. Now if I find some bugs in the...

  • 7550 Views
  • 10 replies
  • 4 kudos
Latest Reply
cpayne_vax
New Contributor III
  • 4 kudos

Answering my own question: nowadays (February 2024) this can all be done via the UI.When viewing your DLT pipeline there is a "Select tables for refresh" button in the header. If you click this, you can select individual tables, and then in the botto...

  • 4 kudos
9 More Replies
brickster_2018
by Esteemed Contributor
  • 10731 Views
  • 3 replies
  • 6 kudos

Resolved! How to add I custom logging in Databricks

I want to add custom logs that redirect in the Spark driver logs. Can I use the existing logger classes to have my application logs or progress message in the Spark driver logs.

  • 10731 Views
  • 3 replies
  • 6 kudos
Latest Reply
Kaizen
Valued Contributor
  • 6 kudos

1) Is it possible to save all the custom logging to its own file? Currently it is being logging with all other cluster logs (see image) 2) Also Databricks it seems like a lot of blank files are also being created for this. Is this a bug? this include...

  • 6 kudos
2 More Replies
sha
by New Contributor
  • 1103 Views
  • 1 replies
  • 0 kudos

Importing data from S3 to Azure DataBricks Cluster with Unity Catalog in Shared Mode

Environment details:DataBricks on Azure, 13.3 LTS, Unity Catalog, Shared Cluster mode.Currently in the environment I'm in, we run imports from S3 with code like:spark.read.option('inferSchema', 'true').json(s3_path).  When running on a cluster in Sha...

  • 1103 Views
  • 1 replies
  • 0 kudos
Latest Reply
BR_DatabricksAI
Contributor
  • 0 kudos

Hello Sha, We usually get such error while working with shared cluster mode assuming this your dev environment just to avoid errors, please use different clusters. However as a alternative solution in case if would like to keep the shared cluster the...

  • 0 kudos
Dhruv-22
by New Contributor III
  • 3139 Views
  • 4 replies
  • 0 kudos

CREATE TABLE does not overwrite location whereas CREATE OR REPLACE TABLE does

I am working on Azure Databricks, with Databricks Runtime version being - 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12). I am facing the following issue.Suppose I have a view named v1 and a database f1_processed created from the following comman...

  • 3139 Views
  • 4 replies
  • 0 kudos
Latest Reply
Ayushi_Suthar
Honored Contributor
  • 0 kudos

Hi @Dhruv-22 ,  Based on the information you shared above, the "CREATE OR REPLACE" and "CREATE" commands in Databricks do have different behaviours, particularly when it comes to handling tables with specific target locations. The "CREATE OR REPLACE"...

  • 0 kudos
3 More Replies
DApt
by New Contributor II
  • 8245 Views
  • 1 replies
  • 2 kudos

REDACTED_POSSIBLE_SECRET_ACCESS_KEY as part of column value result form aes_encrypt

Hi, i've encountered an error using base64/aes_encrypt, as result the string saved contains 'REDACTED_POSSIBLE_SECRET_ACCESS_KEY' at the end destroying the original data, rendering it useless undecryptable, is there a way to avoid this replacement in...

Captura de pantalla 2023-12-11 152523.png DApt_0-1702326511602.png DApt_3-1702327037748.png DApt_1-1702326665014.png
  • 8245 Views
  • 1 replies
  • 2 kudos
Latest Reply
DataEnthusiast1
New Contributor II
  • 2 kudos

I had the same issue, and my usage was similar to OP:base64(aes_encrypt(<clear_text>, unbase64(secret(<scope>, <key>))))Databricks support suggested to not call secret within the insert/update operation that writes to the table. After updating the py...

  • 2 kudos
Dhruv-22
by New Contributor III
  • 2927 Views
  • 4 replies
  • 1 kudos

Resolved! REPLACE TABLE AS SELECT is not working with parquet whereas it works fine for delta

I am working on Azure Databricks, with Databricks Runtime version being - 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12). I am facing the following issue.Suppose I have a view named v1 and a database f1_processed created from the following comman...

  • 2927 Views
  • 4 replies
  • 1 kudos
Latest Reply
Ayushi_Suthar
Honored Contributor
  • 1 kudos

Hi @Dhruv-22  We understand that you are facing the following error when using REPLACE TABLE AS SELECT  on the Parquet Table but at this moment the REPLACE TABLE AS SELECT operation you're trying to perform is not supported for Parquet tables. Accord...

  • 1 kudos
3 More Replies
Kroy
by Contributor
  • 1299 Views
  • 4 replies
  • 0 kudos

Near Real time Solutioning on data from Core System which gets updated

We are trying to build solution , where customer data stored in one of RDBM database SQL server and we are moving this data to delta lake in medallion architecture and want to this to be near real time by using DLT pipeline.Problem is that source tab...

  • 1299 Views
  • 4 replies
  • 0 kudos
Latest Reply
Kroy
Contributor
  • 0 kudos

came across this matrix while reading about DLT what is read from complete and write to Incremental means ?

  • 0 kudos
3 More Replies
billfoster
by New Contributor II
  • 14205 Views
  • 9 replies
  • 4 kudos

how can I learn DataBricks

I am currently enrolled in data engineering boot camp. We go over various technologies azure , pyspark , airflow , Hadoop ,nosql,SQL, python. But not over something like databricks. I am in contact with lots of recent graduates who landed a job. Almo...

  • 14205 Views
  • 9 replies
  • 4 kudos
Latest Reply
Ali23
New Contributor II
  • 4 kudos

 I'd be glad to help you on your journey to learning Databricks! Whether you're a beginner or aiming to advance your skills, here's a comprehensive guide:Foundations:Solid understanding of core concepts: Begin with foundational knowledge in big data,...

  • 4 kudos
8 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels