Data Engineering

Forum Posts

Sorted by:

by User16826992666 • Valued Contributor

06-25-2021 2:10:37 PM

1762 Views
1 replies
1 kudos

Trying to write my dataframe out as a tab separated .txt file but getting an error

When I try to save my file I getorg.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 2 columns.; Is there any way to save a dataframe with more than one column to a .txt file?

Data Engineering

1762 Views
1 replies
1 kudos

06-25-2021 2:10:37 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 2:45:01 PM

1 kudos

Would pyspark.sql.DataFrameWriter.csv work? You could specify the separator (sep) as tabdf.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))

1 kudos

06-25-2021 2:45:01 PM

by User16869510359 • Esteemed Contributor

06-25-2021 2:43:34 PM

844 Views
1 replies
1 kudos

Resolved! How to find what files were added in a specific version of Delta Table

Data Engineering

844 Views
1 replies
1 kudos

06-25-2021 2:43:34 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:44:27 PM

1 kudos

%scala display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json") .where("add is not null") .select("add.path"))

1 kudos

06-25-2021 2:44:27 PM

by User16869510359 • Esteemed Contributor

06-25-2021 2:41:22 PM

1331 Views
1 replies
1 kudos

Resolved! How to find what records were added between the 2 versions of a Delta Table

Data Engineering

1331 Views
1 replies
1 kudos

06-25-2021 2:41:22 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:41:34 PM

1 kudos

SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0

1 kudos

06-25-2021 2:41:34 PM

by jason_mcdonald • New Contributor

06-24-2021 3:46:43 PM

757 Views
2 replies
0 kudos

Is there a way so set DBU or cost limits so I don't get an unexpected bill?

I'm wondering if there's a way to set a monthly budget and have my workloads stop running if I hit it.

Data Engineering

757 Views
2 replies
0 kudos

06-24-2021 3:46:43 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-25-2021 2:35:30 PM

0 kudos

Cluster Policies would help with this not only from a cost management perspective but also standardization of resources across the organization as well simplification for a better user experience. You can find Best Practices on leveraging cluster pol...

0 kudos

06-25-2021 2:35:30 PM

1 More Replies

by User16826992666 • Valued Contributor

06-25-2021 9:05:30 AM

957 Views
1 replies
0 kudos

What is the default location where dataframes are written if I don't specify a location?

If I save a dataframe without specifying a location, where will it end up?

Data Engineering

957 Views
1 replies
0 kudos

06-25-2021 9:05:30 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:30:38 PM

0 kudos

You cant save a dataframe without specifying a location. If you are using saveAsTable API then the table will be created in the hive warehouse location. The default location is user.hive.warehouse

0 kudos

06-25-2021 2:30:38 PM

by User16826992666 • Valued Contributor

06-25-2021 9:11:58 AM

676 Views
1 replies
0 kudos

Why would I make a deep clone of a Delta table vs reading the table and writing a copy to a new location?

It seems like with both techniques I would end up with a copy of my table. Trying to understand when I should be using a deep clone.

Data Engineering

676 Views
1 replies
0 kudos

06-25-2021 9:11:58 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:29:24 PM

0 kudos

A deep clone is recommended way as it holds the history of the table. Also, the DEEP clone is faster than the read-write approach.

0 kudos

06-25-2021 2:29:24 PM

by User16826992666 • Valued Contributor

06-25-2021 9:15:20 AM

782 Views
1 replies
0 kudos

How can I run OPTIMIZE on a table if I am streaming to it 24/7?

I have a table that I need to be continuously streaming into. I know it's best practice to run Optimize on my tables periodically. But if I never stop writing to the table, how and when can I run OPTIMIZE against it?

Data Engineering

782 Views
1 replies
0 kudos

06-25-2021 9:15:20 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:28:20 PM

0 kudos

If the streaming job is making bling appends to the delta table, then it's perfectly fine to run OPTIMIZE query in parallel.However, if the streaming job is performing MERGE or UPDATE then it can conflict with the OPTIMIZE operations. In such cases w...

0 kudos

06-25-2021 2:28:20 PM

by Anonymous • Not applicable

06-25-2021 1:56:40 PM

950 Views
1 replies
0 kudos

DBFS Permissions

if there is permission control on the folder/file level in DBFS.e.g. if a team member uploads a file to /Filestore/Tables/TestData/testfile, could we mask permissions on TestData and/or testfile?

Data Engineering

950 Views
1 replies
0 kudos

06-25-2021 1:56:40 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:23:59 PM

0 kudos

DBFS does not have ACL at this point

0 kudos

06-25-2021 2:23:59 PM

by User16826987838 • Contributor

06-25-2021 1:57:13 PM

757 Views
1 replies
0 kudos

I am looking for default values for the following two properties delta.logRetentionDuration delta.deletedFileRetentionDuration Any insight?

Data Engineering

757 Views
1 replies
0 kudos

06-25-2021 1:57:13 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:23:38 PM

0 kudos

delta.logRetentionDuration - 30 daysdelta.deletedFileRetentionDuration - 7 days

0 kudos

06-25-2021 2:23:38 PM

by User16826987838 • Contributor

06-25-2021 2:04:28 PM

561 Views
1 replies
0 kudos

How do I programatically in python retriev the name of the current attached cluster in a DB notebook?

Data Engineering

561 Views
1 replies
0 kudos

06-25-2021 2:04:28 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 2:22:21 PM

0 kudos

Using cluster tags we can get the cluster name spark.conf.get("spark.databricks.clusterUsageTags.clusterName")

0 kudos

06-25-2021 2:22:21 PM

by User16869510359 • Esteemed Contributor

06-25-2021 1:56:46 PM

462 Views
1 replies
0 kudos

Resolved! Best practices for DStream application in Databricks

I do not see any best practice guide for the DStream application in Databricks docs. Any reference

Data Engineering

462 Views
1 replies
0 kudos

06-25-2021 1:56:46 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 1:57:56 PM

0 kudos

Dstream is unsupported by Databricks. Databrcks strongly recommend migrating the Dstream applications to use Structured Streaminghttps://kb.databricks.com/streaming/dstream-not-supported.html

0 kudos

06-25-2021 1:57:56 PM

by User16869510359 • Esteemed Contributor

06-25-2021 1:43:39 PM

531 Views
1 replies
1 kudos

Optimize Command not performing the bin packing

I have a daily OPTIMIZE job running, however, the number of files in the storage is not going down. Looks like the optimize is not helping to reduce the files.

Data Engineering

531 Views
1 replies
1 kudos

06-25-2021 1:43:39 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 1:45:16 PM

1 kudos

The files are not physically removed from the Storage by the optimize command. A VACUUM command has to be executed to achieve the same

1 kudos

06-25-2021 1:45:16 PM

by Anonymous • Not applicable

06-24-2021 10:07:31 PM

515 Views
1 replies
0 kudos

How can I get access to the free self paced trainings at Databricks? What kind of trainings does it include?

Data Engineering

515 Views
1 replies
0 kudos

06-24-2021 10:07:31 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 1:40:57 PM

0 kudos

https://academy.databricks.com/category/self-paced ?

0 kudos

06-25-2021 1:40:57 PM

by User16790091296 • Contributor II

06-24-2021 8:13:46 AM

14586 Views
1 replies
0 kudos

How to run multiple spark streaming application on databricks cluster?

I started working on databricks. I need to migrate few streaming jobs from Ambari to Databricks. I deployed one job using jar and it. is working fine. But when I deploy the second job I faced an error " multiple spark streaming context not allowed". ...

Data Engineering

14586 Views
1 replies
0 kudos

06-24-2021 8:13:46 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-25-2021 1:38:57 PM

0 kudos

You can run multiple streaming applications in databricks clusters. By default, this would run in the same fair scheduling pool. To enable multiple streaming queries to execute jobs concurrently and to share the cluster efficiently, you can set the q...

0 kudos

06-25-2021 1:38:57 PM

by MoJaMa • Valued Contributor II

06-25-2021 1:32:16 PM

478 Views
1 replies
0 kudos

Can my job have more than 1 owner? What if the original owner leaves the company?

Data Engineering

478 Views
1 replies
0 kudos

06-25-2021 1:32:16 PM

View Replies

Latest Reply

MoJaMa
Valued Contributor II

06-25-2021 1:33:32 PM

0 kudos

We still require a single user to be an owner. But you can set a group to have CAN_MANAGE which unblocks most of the necessary updates. It is released in all Premium workspaces that have Jobs ACLs. The official OWNER is whose identity is used to crea...

0 kudos

06-25-2021 1:33:32 PM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Trying to write my dataframe out as a tab separated .txt file but getting an error

Resolved! How to find what files were added in a specific version of Delta Table

Resolved! How to find what records were added between the 2 versions of a Delta Table

Is there a way so set DBU or cost limits so I don't get an unexpected bill?

What is the default location where dataframes are written if I don't specify a location?

Why would I make a deep clone of a Delta table vs reading the table and writing a copy to a new location?

How can I run OPTIMIZE on a table if I am streaming to it 24/7?

DBFS Permissions

I am looking for default values for the following two properties delta.logRetentionDuration delta.deletedFileRetentionDuration Any insight?

How do I programatically in python retriev the name of the current attached cluster in a DB notebook?

Resolved! Best practices for DStream application in Databricks

Optimize Command not performing the bin packing

How can I get access to the free self paced trainings at Databricks? What kind of trainings does it include?

How to run multiple spark streaming application on databricks cluster?

Can my job have more than 1 owner? What if the original owner leaves the company?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...