cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

c038644
by New Contributor II
  • 1982 Views
  • 3 replies
  • 3 kudos

Use of venv pack

Hi, I very new so this probably sounds stupid...I'm following the blog on How to Manage Python Dependencies in PySpark:https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html...but when I try the packing works fin...

  • 1982 Views
  • 3 replies
  • 3 kudos
Latest Reply
Debayan
Databricks Employee
  • 3 kudos

Can you try using an absolute path instead of a relative path for the same? For example: https://stackoverflow.com/questions/38661464/filenotfounderror-winerror-3

  • 3 kudos
2 More Replies
AnandR
by New Contributor
  • 978 Views
  • 1 replies
  • 1 kudos

I have 2 roles created for my Dbricks acc on AWS. Want to know which role will be used by Dbricks for AWS resources (ex. Cluster Creation)

I have 1 role with AWS root account and 1 role wit AWS non-root account. How do I tell Dbricks to use specific role for cluster creation ? Please guide me here or if any documentation will also suffice . Thanks.

  • 978 Views
  • 1 replies
  • 1 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 1 kudos

Go to settings > Admin ConsoleUnder instance profiles tab you can add an instance profile which is a container for IAM role. Using this you can let EC2 instance know which S3 buckets it can access. Under users tab you can manage users who have access...

  • 1 kudos
TT1
by New Contributor III
  • 2185 Views
  • 2 replies
  • 8 kudos
  • 2185 Views
  • 2 replies
  • 8 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 8 kudos

Notebooks are auto saved and you can track changes by clicking on Revision History on top right corner of the notebook. Also link git repo to your notebook to track changes.

  • 8 kudos
1 More Replies
zyang
by Contributor
  • 1635 Views
  • 1 replies
  • 4 kudos

pyspark delta table schema evolution

I am using the schema evolution in the delta table and the code is written in databricks notebook. df.write .format("delta") .mode("append") .option("mergeSchema", "true") .partitionBy("date") .save(path)But I ...

  • 1635 Views
  • 1 replies
  • 4 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 4 kudos

Hi @z yang​ Please provide the df creation code as well to understand the complete exception and scenario.

  • 4 kudos
j02424
by New Contributor
  • 2861 Views
  • 1 replies
  • 4 kudos

Best practice to delete /dbfs/tmp ?

What is best practice regarding the tmp folder? We have a very large amount of data in that folder and not sure whether to delete, back up etc?

  • 2861 Views
  • 1 replies
  • 4 kudos
Latest Reply
Debayan
Databricks Employee
  • 4 kudos

/dbfs/tmp can contain a lot of files including temporary system files used for intermediary calculations or other sub directories which can contain packages of user defined installations. It is always better to backup the files.

  • 4 kudos
Akshith_Rajesh
by New Contributor III
  • 4520 Views
  • 3 replies
  • 6 kudos

Unable to write Data frame to Azure Synapse Table

When I am trying to insert records into the azure synapse Table using JDBC Its throwing below error com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'COMPANY_ADDRESS_STATE' has a data type that cannot participate ...

  • 4520 Views
  • 3 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

Columns that use any of the following data types cannot be included in a columnstore index:nvarchar(max), varchar(max), and varbinary(max) (Applies to SQL Server 2016 and prior versions, and nonclustered columnstore indexes)so the issue is on the Azu...

  • 6 kudos
2 More Replies
him
by New Contributor III
  • 1181 Views
  • 1 replies
  • 3 kudos
  • 1181 Views
  • 1 replies
  • 3 kudos
Latest Reply
Debayan
Databricks Employee
  • 3 kudos

You can try to refer the example below: https://docs.databricks.com/dev-tools/api/latest/examples.html#upload-a-big-file-into-dbfs

  • 3 kudos
Bharath_1610
by New Contributor
  • 1716 Views
  • 2 replies
  • 1 kudos

Resolved! Check Existence of table

Hi Team,How do we check the existence of a table in ADF container using SQL query in Databricks?Thanks in advance.

  • 1716 Views
  • 2 replies
  • 1 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 1 kudos

Hi, please elaborate on the issue for us to help you resolve it.

  • 1 kudos
1 More Replies
Mr__E
by Contributor II
  • 1285 Views
  • 1 replies
  • 3 kudos

Sync prod WS DBs to dev WS DBs

We have a couple sources we'd already set up to stream to prod using a 3p system. Is there a way to sync this directly to our dev workspace to build pipelines? eg. directly connecting to a cluster in prod and pull with a job cluster, dump to S3 and u...

  • 1285 Views
  • 1 replies
  • 3 kudos
Latest Reply
Debayan
Databricks Employee
  • 3 kudos

DBFS can be used in many ways. Please refer below: Allows you to interact with object storage using directory and file semantics instead of cloud-specific API commands.Allows you to mount cloud object storage locations so that you can map storage cre...

  • 3 kudos
parthibsg
by New Contributor II
  • 1331 Views
  • 1 replies
  • 2 kudos

When to use Dataframes API over Spark SQL

Hello Experts,I am new to Databricks. Building data pipelines, I have both batch and streaming data.Should I use Dataframes API to read csv files then convert to parquet format then do the transformation? orwrite to table using CSV then use Spark SQL...

  • 1331 Views
  • 1 replies
  • 2 kudos
Latest Reply
Debayan
Databricks Employee
  • 2 kudos

Hi Rathinam, It would be better to understand the pipeline more in this situation. Writing to table using CSV and then using spark SQL will be faster in few cases than the other one.

  • 2 kudos
lokeshr
by New Contributor
  • 1170 Views
  • 2 replies
  • 1 kudos

Clarity on usage STREAM while defining DLT tables

Hi, I am currently trying to learn Databricks and going through tutorials and learning materials. I came across this link https://databricks.com/discover/pages/getting-started-with-delta-live-tablesWhile I get most of what is described in page, I fin...

  • 1170 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @Lokesh Raju​,Just a friendly follow-up. Did Tomasz's response help you to resolved your question? If it did, please mark it as best.

  • 1 kudos
1 More Replies
yatharthmahesh
by New Contributor III
  • 2662 Views
  • 3 replies
  • 6 kudos

ENABLE CHANGE DATA FEED FOR EXISTING DELTA-TABLE

I have a delta table already created, now I want to enable the change data feed. I read that I have to set delta.enableChangeDataFeed property to true. But however, this cannot be done using the Scala API. I tried using this but it didn't work. I am ...

  • 2662 Views
  • 3 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

'delta.enableChangeDataFeed' have to be without quotes. spark.sql("ALTER TABLE delta_training.onaudience_dpm SET TBLPROPERTIES (delta.enableChangeDataFeed = true)").show()

  • 6 kudos
2 More Replies
KumarShiv
by New Contributor III
  • 1846 Views
  • 2 replies
  • 2 kudos

Resolved! Databricks Spark SQL function "PERCENTILE_DISC()" output not accurate.

I am try to get the percentile values on different splits but I got that the result of Databricks PERCENTILE_DISC() function is not accurate . I have run the same query on MS SQL but getting different result set.Here are both result sets for Pyspark ...

  • 1846 Views
  • 2 replies
  • 2 kudos
Latest Reply
artsheiko
Databricks Employee
  • 2 kudos

The reason might be that in SQL PERCENTILE_DISC is nondeterministic

  • 2 kudos
1 More Replies
Trung
by Contributor
  • 3135 Views
  • 5 replies
  • 5 kudos

Job fail due to Access Denied

please help me to solve the problem that my data bricks account can not start the Job by triggering manually or scheduling although I can run the script without error.

image.png
  • 3135 Views
  • 5 replies
  • 5 kudos
Latest Reply
Vivian_Wilfred
Databricks Employee
  • 5 kudos

Hi @trung nguyen​ , Please check if you have the necessary instance profile attached to the Job cluster. You are definitely missing something related to the IAM.

  • 5 kudos
4 More Replies
Anonymous
by Not applicable
  • 1447 Views
  • 4 replies
  • 4 kudos

Invalid shard address

I'm running pyspark through databricks-connect and getting an error saying```ERROR SparkClientManager: Fail to get the SparkClientjava.util.concurrent.ExecutionException: com.databricks.service.SparkServiceConnectionException: Invalid shard address:`...

  • 1447 Views
  • 4 replies
  • 4 kudos
Latest Reply
Prabakar
Databricks Employee
  • 4 kudos

hi @Marco Wong​ was this working before and failing now? Are you behind a VPN or firewall? If so can you check by disabling it?enable traces at wireshark and collected dump to check if there is traffic going to workspace?Check if you can get curl wor...

  • 4 kudos
3 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels