cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

brickster_2018
by Esteemed Contributor
  • 795 Views
  • 1 replies
  • 1 kudos
  • 795 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 1 kudos

val oldestVersionAvailable = val newestVersionAvailable = val pathToDeltaTable = "" val pathToFileName = "" (oldestVersionAvailable to newestVersionAvailable).map { version => var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%0...

  • 1 kudos
User16826992666
by Valued Contributor
  • 1988 Views
  • 1 replies
  • 1 kudos

Trying to write my dataframe out as a tab separated .txt file but getting an error

When I try to save my file I getorg.apache.spark.sql.AnalysisException: Text data source supports only a single column, and you have 2 columns.; Is there any way to save a dataframe with more than one column to a .txt file?

  • 1988 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

Would pyspark.sql.DataFrameWriter.csv work? You could specify the separator (sep) as tabdf.write.csv(os.path.join(tempfile.mkdtemp(), 'data'))

  • 1 kudos
brickster_2018
by Esteemed Contributor
  • 1024 Views
  • 1 replies
  • 1 kudos
  • 1024 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 1 kudos

%scala     display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json") .where("add is not null") .select("add.path"))

  • 1 kudos
jason_mcdonald
by New Contributor
  • 987 Views
  • 2 replies
  • 0 kudos

Is there a way so set DBU or cost limits so I don't get an unexpected bill?

I'm wondering if there's a way to set a monthly budget and have my workloads stop running if I hit it.

  • 987 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Cluster Policies would help with this not only from a cost management perspective but also standardization of resources across the organization as well simplification for a better user experience. You can find Best Practices on leveraging cluster pol...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 1170 Views
  • 1 replies
  • 0 kudos

What is the default location where dataframes are written if I don't specify a location?

If I save a dataframe without specifying a location, where will it end up?

  • 1170 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

You cant save a dataframe without specifying a location. If you are using saveAsTable API then the table will be created in the hive warehouse location. The default location is user.hive.warehouse

  • 0 kudos
User16826992666
by Valued Contributor
  • 872 Views
  • 1 replies
  • 0 kudos

Why would I make a deep clone of a Delta table vs reading the table and writing a copy to a new location?

It seems like with both techniques I would end up with a copy of my table. Trying to understand when I should be using a deep clone.

  • 872 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

A deep clone is recommended way as it holds the history of the table. Also, the DEEP clone is faster than the read-write approach.

  • 0 kudos
User16826992666
by Valued Contributor
  • 945 Views
  • 1 replies
  • 0 kudos

How can I run OPTIMIZE on a table if I am streaming to it 24/7?

I have a table that I need to be continuously streaming into. I know it's best practice to run Optimize on my tables periodically. But if I never stop writing to the table, how and when can I run OPTIMIZE against it?

  • 945 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

If the streaming job is making bling appends to the delta table, then it's perfectly fine to run OPTIMIZE query in parallel.However, if the streaming job is performing MERGE or UPDATE then it can conflict with the OPTIMIZE operations. In such cases w...

  • 0 kudos
Anonymous
by Not applicable
  • 1079 Views
  • 1 replies
  • 0 kudos

DBFS Permissions

if there is permission control on the folder/file level in DBFS.e.g. if a team member uploads a file to /Filestore/Tables/TestData/testfile, could we mask permissions on TestData and/or testfile?

  • 1079 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

DBFS does not have ACL at this point

  • 0 kudos
User16826987838
by Contributor
  • 889 Views
  • 1 replies
  • 0 kudos
  • 889 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

delta.logRetentionDuration - 30 daysdelta.deletedFileRetentionDuration - 7 days

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 558 Views
  • 1 replies
  • 0 kudos

Resolved! Best practices for DStream application in Databricks

I do not see any best practice guide for the DStream application in Databricks docs. Any reference

  • 558 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 0 kudos

Dstream is unsupported by Databricks. Databrcks strongly recommend migrating the Dstream applications to use Structured Streaminghttps://kb.databricks.com/streaming/dstream-not-supported.html

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 607 Views
  • 1 replies
  • 1 kudos

Optimize Command not performing the bin packing

I have a daily OPTIMIZE job running, however, the number of files in the storage is not going down. Looks like the optimize is not helping to reduce the files.

  • 607 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Esteemed Contributor
  • 1 kudos

The files are not physically removed from the Storage by the optimize command. A VACUUM command has to be executed to achieve the same

  • 1 kudos
User16790091296
by Contributor II
  • 14826 Views
  • 1 replies
  • 0 kudos

How to run multiple spark streaming application on databricks cluster?

I started working on databricks. I need to migrate few streaming jobs from Ambari to Databricks. I deployed one job using jar and it. is working fine. But when I deploy the second job I faced an error " multiple spark streaming context not allowed". ...

  • 14826 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You can run multiple streaming applications in databricks clusters. By default, this would run in the same fair scheduling pool. To enable multiple streaming queries to execute jobs concurrently and to share the cluster efficiently, you can set the q...

  • 0 kudos
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels