Data Engineering

Forum Posts

Sorted by:

Start a conversation

by User16783853906 • Contributor III

06-23-2021 2:14:56 PM

3647 Views
2 replies
0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

Data Engineering

3647 Views
2 replies
0 kudos

06-23-2021 2:14:56 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 2:26:12 PM

0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

0 kudos

06-23-2021 2:26:12 PM

1 More Replies

by User16783853906 • Contributor III

06-10-2021 2:49:06 PM

1533 Views
2 replies
0 kudos

VACUUM during read/write

Is it safe to run VACUUM on a Delta Lake table while data is being added to it at the same time? Will it impact the job result/performance?

Data Engineering

1533 Views
2 replies
0 kudos

06-10-2021 2:49:06 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:26:03 PM

0 kudos

In the vast majority of cases, yes, it is safe to run VACUUM while data is concurrently being appended or updated to the same table. This is because VACUUM deletes data files no longer referenced by a Delta table's transaction log and does not effect...

0 kudos

06-23-2021 2:26:03 PM

1 More Replies

by User16783853906 • Contributor III

06-10-2021 2:47:11 PM

1886 Views
2 replies
0 kudos

How does running VACUUM on Delta Lake tables effect read/write performance?

If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?

Data Engineering

1886 Views
2 replies
0 kudos

06-10-2021 2:47:11 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:24:26 PM

0 kudos

VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...

0 kudos

06-23-2021 2:24:26 PM

1 More Replies

by User16783855534 • New Contributor III

06-23-2021 12:46:04 PM

684 Views
1 replies
1 kudos

Can I have a Databricks Cluster that is only 1 node?

Yes you can create a "Single Node" Cluster, https://docs.databricks.com/clusters/single-node.html . It is currently not recommended to use "Single Node" cluster for streaming workloads

Data Engineering

684 Views
1 replies
1 kudos

06-23-2021 12:46:04 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 2:17:13 PM

1 kudos

Single Node clusters should not be used for production workloads involving streaming queries, or complex computations. The intention here is to bring up the Spark cluster for all kinds of workloads

1 kudos

06-23-2021 2:17:13 PM

by User16826987838 • Contributor

06-23-2021 2:00:46 PM

446 Views
1 replies
0 kudos

What is the recommended way to log what queries individual users are running in the normal workspace?

Data Engineering

446 Views
1 replies
0 kudos

06-23-2021 2:00:46 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-23-2021 2:02:38 PM

0 kudos

Not yet but it is on the roadmap. Currently, only available in Databricks SQL

0 kudos

06-23-2021 2:02:38 PM

by User16826987838 • Contributor

06-23-2021 1:58:06 PM

1492 Views
1 replies
0 kudos

Is it possible to change the Cluster Creator/Owner of a cluster after it has been created?

Data Engineering

1492 Views
1 replies
0 kudos

06-23-2021 1:58:06 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-23-2021 2:01:27 PM

0 kudos

You can't change the owner. But you can try to clone the cluster or you can also give "Can Manage" to another user but the cluster creator stays fixed.

0 kudos

06-23-2021 2:01:27 PM

by User16826987838 • Contributor

06-23-2021 1:01:23 PM

463 Views
0 replies
0 kudos

How can I tell which runtime the model serving endpoints use?

Data Engineering

463 Views
0 replies
0 kudos

06-23-2021 1:01:23 PM

by User16826987838 • Contributor

06-23-2021 12:30:38 PM

649 Views
1 replies
1 kudos

Is there a way to add users to workspace programmatically (through API?) instead of going manually adding them through the Admin console?

Data Engineering

649 Views
1 replies
1 kudos

06-23-2021 12:30:38 PM

View Replies

Latest Reply

User16783855534
New Contributor III

06-23-2021 12:43:52 PM

1 kudos

https://docs.databricks.com/dev-tools/api/latest/scim/scim-users.html#create-user

1 kudos

06-23-2021 12:43:52 PM

by User16869510359 • Esteemed Contributor

06-23-2021 12:25:51 PM

1346 Views
1 replies
0 kudos

Resolved! What is the difference between spark.sessionState.catalog.listTables vs spark.catalog.listTables

I see a significant performance difference when calling spark.sessionState.catalog.list compared to spark.catalog.list. Is that expected?

Data Engineering

1346 Views
1 replies
0 kudos

06-23-2021 12:25:51 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 12:29:28 PM

0 kudos

spark.sessionState.catalog.listTables is a more lazy implementation.. it does not pull the column details when listing the tables. Hence it's faster. Whereas catalog.listTables will pull the column details as well. If the database has many Delta tabl...

0 kudos

06-23-2021 12:29:28 PM

by User16869510359 • Esteemed Contributor

06-23-2021 12:19:32 PM

3004 Views
1 replies
0 kudos

Resolved! How to list all Delta tables in a Database?

I wanted to get a list of all the Delta tables in a Database. What is the easiest way of getting it.

Data Engineering

3004 Views
1 replies
0 kudos

06-23-2021 12:19:32 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 12:22:17 PM

0 kudos

Below code, the snippet can be used to list down the tables in a databaseval db = "database_name" spark.sessionState.catalog.listTables(db).map(table=>spark.sessionState.catalog.externalCatalog.getTable(table.database.get,table.table)).filter(x=>x....

0 kudos

06-23-2021 12:22:17 PM

by User16826992666 • Valued Contributor

06-22-2021 7:15:48 PM

10770 Views
3 replies
0 kudos

Resolved! When I save a Spark dataframe using df.write.format("csv"), I end up with mulitple csv files. Why is this happening?

Data Engineering

10770 Views
3 replies
0 kudos

06-22-2021 7:15:48 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 12:12:11 PM

0 kudos

This is by design and working as expected. Spark writes the data distributedly. use of coalesce (1) can help to generate one file, however this solution is not scalable for large data set as it involves bringing the data to one single task.

0 kudos

06-23-2021 12:12:11 PM

2 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-23-2021 8:23:24 AM

474 Views
1 replies
1 kudos

Can we use Photon for batch and streaming process instead of Spark, when will be available for public?

Data Engineering

474 Views
1 replies
1 kudos

06-23-2021 8:23:24 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 12:03:30 PM

1 kudos

Photon is supported for batch workloads today and is the standard on Databricks SQL clusters and available as an option for Automated and Interactive clusters. And photon is in public preview today so available as an option for everyone. See this lin...

1 kudos

06-23-2021 12:03:30 PM

by User16869510359 • Esteemed Contributor

06-23-2021 8:31:09 AM

538 Views
2 replies
0 kudos

I don't have Upsert/Merge use cases. Should I use Delta or can I use Parquet?

Data Engineering

538 Views
2 replies
0 kudos

06-23-2021 8:31:09 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 11:58:34 AM

0 kudos

Delta has significant value beyond the DML/ACID capabilities. Delta's data organization strategies that @Ryan Chynoweth mentions also offer an advantage even for read-only use cases for querying and joining the data. Delta also supports in-place con...

0 kudos

06-23-2021 11:58:34 AM

1 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-23-2021 9:40:35 AM

1554 Views
1 replies
0 kudos

Resolved! How do we ingest the data from Salesforce into DeltaLake for any CRM analytics use case

Data Engineering

1554 Views
1 replies
0 kudos

06-23-2021 9:40:35 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-23-2021 11:55:06 AM

0 kudos

This spark-salesforce connector looks like an option to query this data via SOQL/SAQL and brought into Databricks/Spark

0 kudos

06-23-2021 11:55:06 AM

by christys • Community Manager

05-28-2021 12:01:37 PM

406 Views
1 replies
0 kudos

What's the easiest way to deploy a Databricks workspace on AWS?

Data Engineering

406 Views
1 replies
0 kudos

05-28-2021 12:01:37 PM

View Replies

Latest Reply

Taha
New Contributor III

06-23-2021 10:36:30 AM

0 kudos

There's actually several options here!AWSIf you'd like a very quick setup but full featured environment for your org, use the AWS quickstart: https://aws.amazon.com/quickstart/architecture/databricks/If you're solo exploring, you can use Databricks c...

0 kudos

06-23-2021 10:36:30 AM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

Trigger.once mode recommendation

VACUUM during read/write

How does running VACUUM on Delta Lake tables effect read/write performance?

Can I have a Databricks Cluster that is only 1 node?

What is the recommended way to log what queries individual users are running in the normal workspace?

Is it possible to change the Cluster Creator/Owner of a cluster after it has been created?

How can I tell which runtime the model serving endpoints use?

Is there a way to add users to workspace programmatically (through API?) instead of going manually adding them through the Admin console?

Resolved! What is the difference between spark.sessionState.catalog.listTables vs spark.catalog.listTables

Resolved! How to list all Delta tables in a Database?

Resolved! When I save a Spark dataframe using df.write.format("csv"), I end up with mulitple csv files. Why is this happening?

Can we use Photon for batch and streaming process instead of Spark, when will be available for public?

I don't have Upsert/Merge use cases. Should I use Delta or can I use Parquet?

Resolved! How do we ingest the data from Salesforce into DeltaLake for any CRM analytics use case

What's the easiest way to deploy a Databricks workspace on AWS?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...