Data Engineering

Forum Posts

Sorted by:

by User16826994223 • Databricks Employee

06-18-2021 3:47:20 AM

2896 Views
1 replies
0 kudos

Resolved! How to find best model using python in mlflow

I have a use case in mlflow with python code to find a model version that has the best metric (for instance, “accuracy”) among so many versions , I don't want to use web ui but to use python code to achieve this. Any Idea?

Data Engineering

2896 Views
1 replies
0 kudos

06-18-2021 3:47:20 AM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-18-2021 3:48:01 AM

0 kudos

import mlflow client = mlflow.tracking.MlflowClient() runs = client.search_runs("my_experiment_id", "", order_by=["metrics.rmse DESC"], max_results=1) best_run = runs[0]https://mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.M...

0 kudos

06-18-2021 3:48:01 AM

by alexott • Databricks Employee

06-18-2021 2:54:12 AM

4299 Views
1 replies
0 kudos

What libraries could be used for unit testing of the Spark code?

We need to add unit test cases for our code that we're writing using the Scala in Python. But we can't use the calls like `assertEqual` for comparing the content of DataFrames. Are any special libraries for that?

Data Engineering

4299 Views
1 replies
0 kudos

06-18-2021 2:54:12 AM

View Replies

Latest Reply

alexott
Databricks Employee

06-18-2021 3:01:50 AM

0 kudos

There are several libraries for Scala and Python that help with writing unit tests for Spark code.For Scala you can use following:Built-in Spark test suite - it's designed to test all parts of Spark. It supports RDD, Dataframe/Dataset, Streaming API...

0 kudos

06-18-2021 3:01:50 AM

by User16826994223 • Databricks Employee

06-18-2021 1:11:01 AM

1957 Views
0 replies
0 kudos

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud ...

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets. There are two parties...

Data Engineering

1957 Views
0 replies
0 kudos

06-18-2021 1:11:01 AM

by User16790091296 • Databricks Employee

05-28-2021 12:11:00 PM

3173 Views
2 replies
0 kudos

What’s the largest cluster/maximum number of cores you can spin up in the Databricks environment?

Data Engineering

3173 Views
2 replies
0 kudos

05-28-2021 12:11:00 PM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-18-2021 1:06:21 AM

0 kudos

Generally it is limited by cloud provider, initially yo get around 350 cores that can be increased by request to cloud vendor, Till now I have seen 1000 cores and it can go much moreIn addition to subscription limits, the total capacity of cluster...

0 kudos

06-18-2021 1:06:21 AM

1 More Replies

by User16790091296 • Databricks Employee

05-21-2021 11:51:09 AM

2359 Views
2 replies
0 kudos

Can I access Delta tables outside of Databricks Runtime?

Data Engineering

2359 Views
2 replies
0 kudos

05-21-2021 11:51:09 AM

View Replies

Latest Reply

User16826994223
Databricks Employee

06-18-2021 12:50:48 AM

0 kudos

Yes You can , write can be issue but read is fine

0 kudos

06-18-2021 12:50:48 AM

1 More Replies

by User16826992666 • Databricks Employee

06-16-2021 8:06:45 PM

2372 Views
1 replies
0 kudos

Resolved! If I create a shallow clone of a Delta table, then add data to the clone, where is that data stored?

Since a shallow clone only copies the metadata of the original table, I'm wondering where new data would end up. Is it even possible to add data to a shallow clone? Is the data written back to the original source file location?

Data Engineering

2372 Views
1 replies
0 kudos

06-16-2021 8:06:45 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 12:12:29 AM

0 kudos

Shallow Clones are really useful for short-lived use cases such as testing and experimentation . It duplicates the metadata from the source table - and any new data added would go to the location specified while creating the shallow table. >Is the da...

0 kudos

06-18-2021 12:12:29 AM

by User16826992666 • Databricks Employee

06-16-2021 8:24:00 PM

11248 Views
1 replies
0 kudos

Resolved! Can I upload an Excel file to create a table in a workspace?

On the Data tab in the workspace I have the "Create Table" button which gives me the option to upload a local file as a data source. Can I upload an Excel file here? Not sure what kind of files work for this.

Data Engineering

11248 Views
1 replies
0 kudos

06-16-2021 8:24:00 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 12:10:07 AM

0 kudos

Currently the file types supported there are CSV, JSON and Avro. You could, however upload the excel file to the dbfs path under FileStore and write code in a notebook to parse it and persist it to a table

0 kudos

06-18-2021 12:10:07 AM

by User16826992666 • Databricks Employee

06-16-2021 8:08:36 PM

2869 Views
1 replies
0 kudos

Resolved! If I create a clone of a Delta table, does it stay in sync with the original table?

Basically wondering what happens to the clone when updates are made to the original Delta table. Will the changes apply to the cloned table as well?

Data Engineering

2869 Views
1 replies
0 kudos

06-16-2021 8:08:36 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-18-2021 12:03:14 AM

0 kudos

The clone is not a replica and so updates made to the original delta table wouldn't be applies to the clone. However, shallow clones reference data files in the source directory. If you run vacuum on the source table, clients will no longer be able t...

0 kudos

06-18-2021 12:03:14 AM

by User16826992666 • Databricks Employee

06-16-2021 8:41:10 PM

2178 Views
1 replies
0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

Data Engineering

2178 Views
1 replies
0 kudos

06-16-2021 8:41:10 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 11:16:06 PM

0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

0 kudos

06-17-2021 11:16:06 PM

by User16826992666 • Databricks Employee

06-16-2021 8:59:37 PM

1896 Views
1 replies
0 kudos

Resolved! Do I still need to use skew join hints if I have Adaptive Query Execution enabled?

From what I have read about AQE it seems to do a lot of what skew join hints did automatically. So should I still be using skew hints in my queries? Is there harm in using them?

Data Engineering

1896 Views
1 replies
0 kudos

06-16-2021 8:59:37 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 11:13:31 PM

0 kudos

With AQE Databricks has the most up-to-date accurate statistics at the end of a query stage and can opt for a better physical strategy and or do optimizations that used to require hints,In the case of skew join hints, is recommended to rely on AQE...

0 kudos

06-17-2021 11:13:31 PM

by User15787040559 • Databricks Employee

06-07-2021 9:17:42 AM

3060 Views
1 replies
0 kudos

What is the maximum number of clusters per workspace in Azure Databricks?

It's governed by Azure subscription limits.

Data Engineering

3060 Views
1 replies
0 kudos

06-07-2021 9:17:42 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 10:58:46 PM

0 kudos

In addition to subscription limits, the total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if...

0 kudos

06-17-2021 10:58:46 PM

by User16826992666 • Databricks Employee

06-16-2021 8:27:12 PM

7023 Views
2 replies
0 kudos

Resolved! Can multiple streams write to a Delta table at the same time?

Wondering if there any dangers to doing this, and if it's a best practice. I'm concerned there could be conflicts but I'm not sure how Delta would handle it.

Data Engineering

7023 Views
2 replies
0 kudos

06-16-2021 8:27:12 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 10:51:32 PM

0 kudos

>Can multiple streams write to a Delta table at the same time?Yes delta uses optimistic concurrency control and configurable isolation levels>I'm concerned there could be conflicts but I'm not sure how Delta would handle it.Write operations can resul...

0 kudos

06-17-2021 10:51:32 PM

1 More Replies

by User16790091296 • Databricks Employee

05-21-2021 11:40:49 AM

1876 Views
1 replies
0 kudos

What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?

I've been doing some research on optimizing data storage while implementing delta, however, I'm not sure which instance type would be best for this.

Data Engineering

1876 Views
1 replies
0 kudos

05-21-2021 11:40:49 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 10:26:45 PM

0 kudos

OPTIMIZE as you alluded has two operations , Bin-packing and multi-dimensional clustering ( zorder)Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effectZ-Ordering is not idempotent b...

0 kudos

06-17-2021 10:26:45 PM

by MoJaMa • Databricks Employee

06-17-2021 5:51:20 PM

4423 Views
2 replies
0 kudos

Do you have a demo built out for monitoring model drift in Databricks?

Data Engineering

4423 Views
2 replies
0 kudos

06-17-2021 5:51:20 PM

View Replies

Latest Reply

User16783853898
Databricks Employee

06-17-2021 8:28:10 PM

0 kudos

Code from DAIS 2021: https://github.com/chengyin38/dais_2021_drifting_away

0 kudos

06-17-2021 8:28:10 PM

1 More Replies

by User16826994223 • Databricks Employee

06-15-2021 7:49:07 AM

1863 Views
3 replies
0 kudos

What is Autolader in Databricks?

Want to Know what is Autoloader and what are its advantages

Data Engineering

1863 Views
3 replies
0 kudos

06-15-2021 7:49:07 AM

View Replies

Latest Reply

MoJaMa
Databricks Employee

06-17-2021 6:47:19 PM

0 kudos

The biggest advantage is the ease with which you can star ingesting data from your Cloud Storage directly into a Delta Table. You can choose Directory Listing mode or File Notification mode, depending on what fits your use case best.

0 kudos

06-17-2021 6:47:19 PM

2 More Replies

Databricks Community

Forum Posts

Resolved! How to find best model using python in mlflow

What libraries could be used for unit testing of the Spark code?

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud ...

What’s the largest cluster/maximum number of cores you can spin up in the Databricks environment?

Can I access Delta tables outside of Databricks Runtime?

Resolved! If I create a shallow clone of a Delta table, then add data to the clone, where is that data stored?

Resolved! Can I upload an Excel file to create a table in a workspace?

Resolved! If I create a clone of a Delta table, does it stay in sync with the original table?

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

Resolved! Do I still need to use skew join hints if I have Adaptive Query Execution enabled?

What is the maximum number of clusters per workspace in Azure Databricks?

Resolved! Can multiple streams write to a Delta table at the same time?

What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?

Do you have a demo built out for monitoring model drift in Databricks?

What is Autolader in Databricks?

Join Us as a Local Community Builder!

Hive Metastore End of Life

DLT Pipeline with unknown deleted source data

[Databricks Asset Bundles] Bug: driver_node_type_i...

Global Parameter at the Pipeline level in Lakeflow...

oracle sequence number