cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

AChang
by New Contributor III
  • 2917 Views
  • 1 replies
  • 0 kudos

Best Cluster Setup for intensive transformation workload

I have a pyspark dataframe, 61k rows, 3 columns, one of which is a string column which has a max length of 4k. I'm doing about 100 different regexp_replace functions on this dataframe, so, very resource intensive. I'm trying to write this to a delta ...

Data Engineering
cluster
ETL
regex
  • 2917 Views
  • 1 replies
  • 0 kudos
Latest Reply
Leonardo
New Contributor III
  • 0 kudos

It seems that you're trying to apply a lot of transformations, but it's basic stuff, so I'd go for the best practices documentation and find a way to create a compute-optimized cluster.Ref.: https://docs.databricks.com/en/clusters/cluster-config-best...

  • 0 kudos
AryaMa
by New Contributor III
  • 39510 Views
  • 13 replies
  • 8 kudos

Resolved! reading data from url using spark

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ? url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv" from pyspark import SparkFiles spark.sparkContext.addFil...

  • 39510 Views
  • 13 replies
  • 8 kudos
Latest Reply
padang
New Contributor II
  • 8 kudos

Sorry, bringing this back up...​from pyspark import SparkFiles url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv" spark.sparkContext.addFile(url) df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSc...

  • 8 kudos
12 More Replies
Abhishek7781
by New Contributor II
  • 4421 Views
  • 1 replies
  • 0 kudos

Unable to run a dbt project through a Databricks Workflows

I'm trying to run a dbt project which reads data from ADLS and writes back to ADLS using a Databricks Workflow. When I run the same project from my local machine (using python virtual environment from Visual Studio Code), it's running perfectly fine ...

Abhishek7781_0-1691482504468.png
  • 4421 Views
  • 1 replies
  • 0 kudos
Latest Reply
Abhishek7781
New Contributor II
  • 0 kudos

Tried installing an older version (2.1.0) of databricks-sql-connector (instead of 2.7.0) and surprisingly a new error message appeared. Don't know how to fix this now. 

  • 0 kudos
AnaLippross
by New Contributor
  • 11343 Views
  • 1 replies
  • 1 kudos

Schema issues with External Tables

Hi everyone!We have started using Unity Catalog in our Project and I am seeing weird behavior with the schemas from external tables imported to Databricks. On Data Explorer when I expand some tables I see that the schema of those specific tables is w...

Data Engineering
External Tables
Unity Catalog
  • 11343 Views
  • 1 replies
  • 1 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 1 kudos

 It seems like you are encountering an issue with the schema mapping when importing external tables to Unity Catalog in Databricks.To troubleshoot thisBased on the information you've provided, it sounds like the issue you're experiencing could be rel...

  • 1 kudos
xavier20
by New Contributor
  • 18097 Views
  • 2 replies
  • 1 kudos

SQL Execution API Code 400

I am trying to execute the following command to test API but getting response 400 import jsonimport osfrom urllib.parse import urljoin, urlencodeimport pyarrowimport requests# NOTE set debuglevel = 1 (or higher) for http debug loggingfrom http.client...

  • 18097 Views
  • 2 replies
  • 1 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 1 kudos

A 400 status code response indicates that the server was unable to process the request due to a client error, e.g., incorrect syntax, invalid parametersBased on the code you provided, it appears that you are trying to execute a SQL query against your...

  • 1 kudos
1 More Replies
PrebenOlsen
by New Contributor III
  • 3952 Views
  • 3 replies
  • 1 kudos

Can't start, delete, unpin or edit cluster: User is not part of org

Hi!Getting error message:DatabricksError: User XXX is not part of org: YYY. Config: host=https://adb-ZZZ.azuredatabricks.net, auth_type=runtimeI am in the admin's group, but I cannot alter this in any way. I've tried using the databricks-SDK using:fr...

  • 3952 Views
  • 3 replies
  • 1 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 1 kudos

To resolve this issue, I would recommend taking the following steps:Verify that you have the correct access and permissions:Check with your Databricks organization admin to ensure that your user account has the appropriate access level and permission...

  • 1 kudos
2 More Replies
Christine
by Contributor II
  • 37654 Views
  • 4 replies
  • 1 kudos

pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Not sure if this is the right place to ask this question, so let me know if it is not. I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_pat...

  • 37654 Views
  • 4 replies
  • 1 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 1 kudos

 It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().One way to handle invalid references is to use the openpyxl engine instead of xlr...

  • 1 kudos
3 More Replies
Thor
by New Contributor III
  • 32239 Views
  • 3 replies
  • 6 kudos

How to remove duplicates in a Delta table?

I made multiple inserts (by error) in a Delta table and I have now strict duplicates, I feel like it's impossible to delete them if you don't have a column "IDENTITY" to distinguish lines (the primary key is RLOC+LOAD_DATE):it sounds odd to me not to...

snap delete snap identity
  • 32239 Views
  • 3 replies
  • 6 kudos
Latest Reply
Ken_H
New Contributor II
  • 6 kudos

There are several great ways to handle this:  https://stackoverflow.com/questions/61674476/how-to-drop-duplicates-in-delta-tableThis was my preference: with cte as(Select col1,col2,col3,etc,row_number()over(partition by col1,col2,col3,etc order by co...

  • 6 kudos
2 More Replies
raghunathr
by New Contributor III
  • 18323 Views
  • 2 replies
  • 4 kudos

Resolved! Benefits of Databricks Views vs Tables

Do we have any explicit benefits with Databricks Views when the view going to be a simple select of table?Does it improve performance by using views over tables?Giving access to views vs Tables?  

  • 18323 Views
  • 2 replies
  • 4 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 4 kudos

There can be several benefits to using Databricks views, even when the view is a simple select of a table:Improved query readability and maintainability:By encapsulating queries in views, you can simplify complex queries, making them more readable an...

  • 4 kudos
1 More Replies
ashish577
by New Contributor III
  • 3852 Views
  • 3 replies
  • 1 kudos

Any way to access unity catalog location through python/dbutils

I have a table created at unity catalog that was dropped, the files are not deleted due to the 30 day soft delete. Is there anyway to copy the files to a different location? When I try to use dbutils.fs.cp I get location overlap error with unity cata...

  • 3852 Views
  • 3 replies
  • 1 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 1 kudos

 You can use the dbutils.fs.mv command to move the files from the deleted table to a new location. Here's an example of how to do it: python# Define the pathssource_path = "dbfs:/mnt/<unity-catalog-location>/<database-name>/<table-name>"target_path =...

  • 1 kudos
2 More Replies
sarnendude
by New Contributor II
  • 4930 Views
  • 3 replies
  • 2 kudos

Unable to enable Databricks Assistant

Databricks Assistant is currently in Public Preview.As per below documentation, I have clicked 'Account Console' link to logins & enable Databricks Assistant but I am not getting "Settings" option at left side in admin console.Once I log in using Azu...

Data Engineering
databricksassistant
  • 4930 Views
  • 3 replies
  • 2 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 2 kudos

To enable Databricks Assistant, you need to navigate to the Admin Console in your Databricks workspace and follow these steps:Log in to your Databricks workspace using an account with workspace admin privileges.Click on the "Admin Console" icon in th...

  • 2 kudos
2 More Replies
User16783853501
by Databricks Employee
  • 4688 Views
  • 2 replies
  • 2 kudos

Using Delta Time Travel what is the scalability limit for using the feature, at what point does the time travel become infeasible?

Using Delta Time Travel what is the scalability limit for using the feature, at what point does the time travel become infeasible? 

  • 4688 Views
  • 2 replies
  • 2 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 2 kudos

The scalability limit for using Delta Time Travel depends on several factors, including the size of your Delta tables, the frequency of changes to the tables, and the retention periods for the Delta versions.In general, Delta Time Travel can become i...

  • 2 kudos
1 More Replies
ravi28
by New Contributor III
  • 28607 Views
  • 7 replies
  • 8 kudos

How to setup Job notifications using Microsoft Teams webhook ?

Couple of things I tried:1. I created a webhook connector in msft teams and copied it Notifications destinations via Admin page -> New destination -> from dropdown I selected Microsoft teams -> added webhook url and saved it.outcome: I don't get the ...

  • 28607 Views
  • 7 replies
  • 8 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 8 kudos

You can set up job notifications for Databricks jobs using Microsoft Teams webhooks by following these steps:Set up a Microsoft Teams webhook:Go to the channel where you want to receive notifications in Microsoft Teams.Click on the "..." icon next to...

  • 8 kudos
6 More Replies
bzh
by New Contributor
  • 6251 Views
  • 3 replies
  • 3 kudos

Large Data ingestion issue using auto loader

 The goal of this project is to ingest 1000+ files (100MB per file) from S3 into Databricks. Since this will be incremental changes, we are using Autoloader for continued ingestion and transformation using a cluster (i3.xlarge). The current process i...

  • 6251 Views
  • 3 replies
  • 3 kudos
Latest Reply
youssefmrini
Databricks Employee
  • 3 kudos

 There are several possible ways to improve the performance of your Spark streaming job for ingesting a large volume of S3 files. Here are a few suggestions:Tune the spark.sql.shuffle.partitions config parameter:By default, the number of shuffle part...

  • 3 kudos
2 More Replies
elifa
by New Contributor II
  • 4094 Views
  • 3 replies
  • 1 kudos

DLT cloudfiles trigger interval not working

I have the following streaming table definition using cloudfiles format and pipelines.trigger.interval setting to reduce file discovery costs but the query is triggering every 12 seconds instead of every 5 minutes.Is there another configuration I am ...

Data Engineering
autloader
cloudFiles
dlt
trigger
  • 4094 Views
  • 3 replies
  • 1 kudos
Latest Reply
Tharun-Kumar
Databricks Employee
  • 1 kudos

@elifa Could you check for this message in the log file? INFO EnzymePlanner: Planning for flow: s3_dataAccording to the config pipelines.trigger.interval, the planning should happen once in every 5 minutes. 

  • 1 kudos
2 More Replies
Labels