Data Engineering

Forum Posts

Sorted by:

by brickster_2018 • Databricks Employee

06-23-2021 4:25:32 PM

2932 Views
2 replies
0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

Data Engineering

2932 Views
2 replies
0 kudos

06-23-2021 4:25:32 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:29:42 PM

0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

0 kudos

06-23-2021 4:29:42 PM

1 More Replies

by cgrant • Databricks Employee

06-08-2021 3:35:20 PM

3401 Views
1 replies
0 kudos

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?

Data Engineering

3401 Views
1 replies
0 kudos

06-08-2021 3:35:20 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:17:27 PM

0 kudos

Idempotency can be ensured by providing the idempotency token. It's easy to pass the same through REST API as mentioned in the below doc:https://kb.databricks.com/jobs/jobs-idempotency.htmlThe primary reason for multiple runs is the client submits t...

0 kudos

06-23-2021 4:17:27 PM

by brickster_2018 • Databricks Employee

06-23-2021 4:04:09 PM

1636 Views
1 replies
0 kudos

Resolved! Performance improvement after running VACUUM commands

How often should I run VACUUM commands? Will running the VACUUM command on a Delta table improve my read/write performance or is it just the storage benefits.

Data Engineering

1636 Views
1 replies
0 kudos

06-23-2021 4:04:09 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 4:08:03 PM

0 kudos

VACUUM removes uncommitted/stale files from the Storage. The primary benefit is to save the storage cost. Ideally running VACUUM should not show any performance improvement as Delta does not list the storage directories but rather access the files di...

0 kudos

06-23-2021 4:08:03 PM

by cgrant • Databricks Employee

06-23-2021 3:08:09 PM

2172 Views
1 replies
1 kudos

Does running OPTIMIZE on a delta table destroy the transaction history of table?

If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?

Data Engineering

2172 Views
1 replies
1 kudos

06-23-2021 3:08:09 PM

View Replies

Latest Reply

cgrant
Databricks Employee

06-23-2021 3:34:03 PM

1 kudos

No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.

1 kudos

06-23-2021 3:34:03 PM

by User16826987838 • Databricks Employee

06-23-2021 2:23:23 PM

2590 Views
1 replies
0 kudos

How do I find the users in workspaces

Looking to pull a list of all the users in their workspaces (including the ones who have never done anything), is there a way to do that? This for AWS

Data Engineering

2590 Views
1 replies
0 kudos

06-23-2021 2:23:23 PM

View Replies

Latest Reply

Ryan_Chynoweth
Databricks Employee

06-23-2021 2:59:11 PM

0 kudos

You can use the SKIM APIs. Endpoint: https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#get-users Or you can use the Workspace API. The workspace API does not have a direct list users command, but you can use the workspace API to l...

0 kudos

06-23-2021 2:59:11 PM

by User16783853906 • Databricks Employee

06-08-2021 3:25:34 PM

15014 Views
2 replies
4 kudos

Resolved! Max Columns for Delta table

Is there an upper limit/recommended max value for no. of columns for Delta table?

Data Engineering

15014 Views
2 replies
4 kudos

06-08-2021 3:25:34 PM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 2:35:20 PM

4 kudos

Original answer posted by @Gray Gwizdz This was a fun question to try and find the answer to! Thank you for that I reviewed some of the most recent issues/bugs reported with Delta Lake and was able to find a similar issue where a user was running i...

4 kudos

06-23-2021 2:35:20 PM

1 More Replies

by User16783853501 • Databricks Employee

06-23-2021 2:28:35 PM

2225 Views
0 replies
1 kudos

Databricks Autoloader Best practice

Databricks Autoloader is a popular mechanism for ingesting data/files from cloud storage into Delta; for a very high throughput source, what are the best practices to be following while scaling up an autoloader based pipeline to the tune of millions ...

Data Engineering

2225 Views
0 replies
1 kudos

06-23-2021 2:28:35 PM

by User16783853906 • Databricks Employee

06-08-2021 2:44:50 PM

2537 Views
3 replies
0 kudos

Resolved! How to resuse Pandas code in PySpark?

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

Data Engineering

2537 Views
3 replies
0 kudos

06-08-2021 2:44:50 PM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 2:28:25 PM

0 kudos

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking ...

0 kudos

06-23-2021 2:28:25 PM

2 More Replies

by User16783853906 • Databricks Employee

06-23-2021 2:14:56 PM

8594 Views
2 replies
0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

Data Engineering

8594 Views
2 replies
0 kudos

06-23-2021 2:14:56 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 2:26:12 PM

0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

0 kudos

06-23-2021 2:26:12 PM

1 More Replies

by User16783853906 • Databricks Employee

06-10-2021 2:49:06 PM

3679 Views
2 replies
0 kudos

VACUUM during read/write

Is it safe to run VACUUM on a Delta Lake table while data is being added to it at the same time? Will it impact the job result/performance?

Data Engineering

3679 Views
2 replies
0 kudos

06-10-2021 2:49:06 PM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 2:26:03 PM

0 kudos

In the vast majority of cases, yes, it is safe to run VACUUM while data is concurrently being appended or updated to the same table. This is because VACUUM deletes data files no longer referenced by a Delta table's transaction log and does not effect...

0 kudos

06-23-2021 2:26:03 PM

1 More Replies

by User16783853906 • Databricks Employee

06-10-2021 2:47:11 PM

3968 Views
2 replies
0 kudos

How does running VACUUM on Delta Lake tables effect read/write performance?

If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?

Data Engineering

3968 Views
2 replies
0 kudos

06-10-2021 2:47:11 PM

View Replies

Latest Reply

User16783853906
Databricks Employee

06-23-2021 2:24:26 PM

0 kudos

VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...

0 kudos

06-23-2021 2:24:26 PM

1 More Replies

by User16783855534 • Databricks Employee

06-23-2021 12:46:04 PM

1875 Views
1 replies
1 kudos

Can I have a Databricks Cluster that is only 1 node?

Yes you can create a "Single Node" Cluster, https://docs.databricks.com/clusters/single-node.html . It is currently not recommended to use "Single Node" cluster for streaming workloads

Data Engineering

1875 Views
1 replies
1 kudos

06-23-2021 12:46:04 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-23-2021 2:17:13 PM

1 kudos

Single Node clusters should not be used for production workloads involving streaming queries, or complex computations. The intention here is to bring up the Spark cluster for all kinds of workloads

1 kudos

06-23-2021 2:17:13 PM

by User16826987838 • Databricks Employee

06-23-2021 2:00:46 PM

1144 Views
1 replies
0 kudos

What is the recommended way to log what queries individual users are running in the normal workspace?

Data Engineering

1144 Views
1 replies
0 kudos

06-23-2021 2:00:46 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-23-2021 2:02:38 PM

0 kudos

Not yet but it is on the roadmap. Currently, only available in Databricks SQL

0 kudos

06-23-2021 2:02:38 PM

by User16826987838 • Databricks Employee

06-23-2021 1:01:23 PM

957 Views
0 replies
0 kudos

How can I tell which runtime the model serving endpoints use?

Data Engineering

957 Views
0 replies
0 kudos

06-23-2021 1:01:23 PM

by User16826987838 • Databricks Employee

06-23-2021 12:30:38 PM

1692 Views
1 replies
1 kudos

Is there a way to add users to workspace programmatically (through API?) instead of going manually adding them through the Admin console?

Data Engineering

1692 Views
1 replies
1 kudos

06-23-2021 12:30:38 PM

View Replies

Latest Reply

User16783855534
Databricks Employee

06-23-2021 12:43:52 PM

1 kudos

https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#create-user

1 kudos

06-23-2021 12:43:52 PM

Databricks Community

Forum Posts

Resolved! Autoloader: How to identify the backlog in RocksDB

Resolved! How to ensure that a Databricks Run Submit run invoked from Airflow only runs one time?

Resolved! Performance improvement after running VACUUM commands

Does running OPTIMIZE on a delta table destroy the transaction history of table?

How do I find the users in workspaces

Resolved! Max Columns for Delta table

Databricks Autoloader Best practice

Resolved! How to resuse Pandas code in PySpark?

Trigger.once mode recommendation

VACUUM during read/write

How does running VACUUM on Delta Lake tables effect read/write performance?

Can I have a Databricks Cluster that is only 1 node?

What is the recommended way to log what queries individual users are running in the normal workspace?

How can I tell which runtime the model serving endpoints use?

Is there a way to add users to workspace programmatically (through API?) instead of going manually adding them through the Admin console?

Join Us as a Local Community Builder!

No rows returned when calling Databricks procedure...

Trouble Enabling File Events For An External Locat...

Want to use DataFrame equality functions but also ...

Loading CSV from private S3 bucket

DATA_SOURCE_NOT_FOUND Error with MongoDB (Suggesti...