cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

brickster_2018
by Databricks Employee
  • 1636 Views
  • 1 replies
  • 0 kudos

Resolved! Performance improvement after running VACUUM commands

How often should I run VACUUM commands? Will running the VACUUM command on a Delta table improve my read/write performance or is it just the storage benefits.

  • 1636 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

VACUUM removes uncommitted/stale files from the Storage. The primary benefit is to save the storage cost. Ideally running VACUUM should not show any performance improvement as Delta does not list the storage directories but rather access the files di...

  • 0 kudos
cgrant
by Databricks Employee
  • 2166 Views
  • 1 replies
  • 1 kudos

Does running OPTIMIZE on a delta table destroy the transaction history of table?

If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?

  • 2166 Views
  • 1 replies
  • 1 kudos
Latest Reply
cgrant
Databricks Employee
  • 1 kudos

No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.

  • 1 kudos
User16826987838
by Databricks Employee
  • 2590 Views
  • 1 replies
  • 0 kudos

How do I find the users in workspaces

Looking to pull a list of all the users in their workspaces (including the ones who have never done anything), is there a way to do that? This for AWS

  • 2590 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Databricks Employee
  • 0 kudos

You can use the SKIM APIs. Endpoint: https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#get-users Or you can use the Workspace API. The workspace API does not have a direct list users command, but you can use the workspace API to l...

  • 0 kudos
User16783853906
by Databricks Employee
  • 15013 Views
  • 2 replies
  • 4 kudos

Resolved! Max Columns for Delta table

Is there an upper limit/recommended max value for no. of columns for Delta table?

  • 15013 Views
  • 2 replies
  • 4 kudos
Latest Reply
User16783853906
Databricks Employee
  • 4 kudos

Original answer posted by @Gray Gwizdz​ This was a fun question to try and find the answer to! Thank you for that I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running i...

  • 4 kudos
1 More Replies
User16783853501
by Databricks Employee
  • 2224 Views
  • 0 replies
  • 1 kudos

Databricks Autoloader Best practice

Databricks Autoloader is a popular mechanism for ingesting data/files from cloud storage into Delta; for a very high throughput source, what are the best practices to be following while scaling up an autoloader based pipeline to the tune of millions ...

  • 2224 Views
  • 0 replies
  • 1 kudos
User16783853906
by Databricks Employee
  • 2537 Views
  • 3 replies
  • 0 kudos

Resolved! How to resuse Pandas code in PySpark?

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

  • 2537 Views
  • 3 replies
  • 0 kudos
Latest Reply
User16783853906
Databricks Employee
  • 0 kudos

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking ...

  • 0 kudos
2 More Replies
User16783853906
by Databricks Employee
  • 8591 Views
  • 2 replies
  • 0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

  • 8591 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

  • 0 kudos
1 More Replies
User16783853906
by Databricks Employee
  • 3679 Views
  • 2 replies
  • 0 kudos

VACUUM during read/write

Is it safe to run VACUUM on a Delta Lake table while data is being added to it at the same time?  Will it impact the job result/performance?

  • 3679 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16783853906
Databricks Employee
  • 0 kudos

In the vast majority of cases, yes, it is safe to run VACUUM while data is concurrently being appended or updated to the same table. This is because VACUUM deletes data files no longer referenced by a Delta table's transaction log and does not effect...

  • 0 kudos
1 More Replies
User16783853906
by Databricks Employee
  • 3967 Views
  • 2 replies
  • 0 kudos

How does running VACUUM on Delta Lake tables effect read/write performance?

If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?

  • 3967 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16783853906
Databricks Employee
  • 0 kudos

VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...

  • 0 kudos
1 More Replies
User16783855534
by Databricks Employee
  • 1874 Views
  • 1 replies
  • 1 kudos

Can I have a Databricks Cluster that is only 1 node?

Yes you can create a "Single Node" Cluster, https://docs.databricks.com/clusters/single-node.html . It is currently not recommended to use "Single Node" cluster for streaming workloads

  • 1874 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

Single Node clusters should not be used for production workloads involving streaming queries, or complex computations. The intention here is to bring up the Spark cluster for all kinds of workloads

  • 1 kudos
User16826987838
by Databricks Employee
  • 1691 Views
  • 1 replies
  • 1 kudos
  • 1691 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16783855534
Databricks Employee
  • 1 kudos

https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#create-user

  • 1 kudos
brickster_2018
by Databricks Employee
  • 2861 Views
  • 1 replies
  • 0 kudos

Resolved! What is the difference between spark.sessionState.catalog.listTables vs spark.catalog.listTables

I see a significant performance difference when calling spark.sessionState.catalog.list compared to spark.catalog.list. Is that expected?

  • 2861 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

spark.sessionState.catalog.listTables is a more lazy implementation.. it does not pull the column details when listing the tables. Hence it's faster. Whereas catalog.listTables will pull the column details as well. If the database has many Delta tabl...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 6564 Views
  • 1 replies
  • 0 kudos

Resolved! How to list all Delta tables in a Database?

I wanted to get a list of all the Delta tables in a Database. What is the easiest way of getting it.

  • 6564 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Below code, the snippet can be used to list down the tables in a databaseval db = "database_name"   spark.sessionState.catalog.listTables(db).map(table=>spark.sessionState.catalog.externalCatalog.getTable(table.database.get,table.table)).filter(x=>x....

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels