cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MoJaMa
by Databricks Employee
  • 1000 Views
  • 1 replies
  • 0 kudos
  • 1000 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

Hosting your own internal PyPI mirror. That will allow you to manage and approve packages vs directly downloading from public PyPI and then also would remove dependency on an external serviceUpload all wheel files to DBFS, maybe through a CI/CD proce...

  • 0 kudos
User16860826802
by New Contributor III
  • 6921 Views
  • 1 replies
  • 1 kudos

Resolved! Why does my cluster keeps disappearing?

My team and I were using a cluster for some days and it disappeared without any apparent reason. I recreate the cluster, but after some days it disappeared again. Do you know why my cluster disappeared? how to avoid that?

  • 6921 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16860826802
New Contributor III
  • 1 kudos

A cluster is deleted after 30 days after a cluster is terminated. To keep an all-purpose cluster configuration even after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 70 clusters can be pinned.To av...

  • 1 kudos
Anonymous
by Not applicable
  • 1123 Views
  • 1 replies
  • 0 kudos
  • 1123 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752244127
Contributor
  • 0 kudos

Delta Lake is a data storage and management layer that fixes the issues with existing data lakes, e.g. on S3, GCS or ADLS. Delta supports streaming and batch operations. It's an open source project, donated to the Linux Foundation. You can check it o...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 6219 Views
  • 1 replies
  • 1 kudos

What is overwatch ?

I heard Databricks recommend overwatch for monitoring Clusters, Can anybody help like what all metrics it will provide that , how it is helpful in monitoring or better than ganglia ?

  • 6219 Views
  • 1 replies
  • 1 kudos
Latest Reply
alexott
Databricks Employee
  • 1 kudos

Overwatch is a different kind of tool - right now it couldn't be used for real-time monitoring, like, Ganglia. Overwatch collects data from the multiple data sources (audit logs, APIs, cluster logs, etc.), process, enrich and aggregate them following...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 2066 Views
  • 1 replies
  • 0 kudos

Resolved! How to find best model using python in mlflow

I have a use case in mlflow with python code to find a model version that has the best metric (for instance, “accuracy”) among so many versions , I don't want to use web ui but to use python code to achieve this. Any Idea?

  • 2066 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

import mlflow client = mlflow.tracking.MlflowClient() runs = client.search_runs("my_experiment_id", "", order_by=["metrics.rmse DESC"], max_results=1) best_run = runs[0]https://mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.M...

  • 0 kudos
alexott
by Databricks Employee
  • 2525 Views
  • 1 replies
  • 0 kudos

What libraries could be used for unit testing of the Spark code?

We need to add unit test cases for our code that we're writing using the Scala in Python. But we can't use the calls like `assertEqual` for comparing the content of DataFrames. Are any special libraries for that?

  • 2525 Views
  • 1 replies
  • 0 kudos
Latest Reply
alexott
Databricks Employee
  • 0 kudos

There are several libraries for Scala and Python that help with writing unit tests for Spark code.For Scala you can use following:Built-in Spark test suite - it's designed to test all parts of Spark. It supports RDD, Dataframe/Dataset, Streaming API...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1049 Views
  • 0 replies
  • 0 kudos

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud ...

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets. There are two parties...

blog-delta-sharing-under-the-hood
  • 1049 Views
  • 0 replies
  • 0 kudos
User16790091296
by Contributor II
  • 2205 Views
  • 2 replies
  • 0 kudos
  • 2205 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

Generally it is limited by cloud provider, initially yo get around 350 cores that can be increased by request to cloud vendor, Till now I have seen 1000 cores and it can go much moreIn addition to subscription limits, the total capacity of cluster...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 1638 Views
  • 1 replies
  • 0 kudos

Resolved! If I create a shallow clone of a Delta table, then add data to the clone, where is that data stored?

Since a shallow clone only copies the metadata of the original table, I'm wondering where new data would end up. Is it even possible to add data to a shallow clone? Is the data written back to the original source file location?

  • 1638 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Shallow Clones are really useful for short-lived use cases such as testing and experimentation . It duplicates the metadata from the source table - and any new data added would go to the location specified while creating the shallow table. >Is the da...

  • 0 kudos
User16826992666
by Valued Contributor
  • 9626 Views
  • 1 replies
  • 0 kudos

Resolved! Can I upload an Excel file to create a table in a workspace?

On the Data tab in the workspace I have the "Create Table" button which gives me the option to upload a local file as a data source. Can I upload an Excel file here? Not sure what kind of files work for this.

  • 9626 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Currently the file types supported there are CSV, JSON and Avro. You could, however upload the excel file to the dbfs path under FileStore and write code in a notebook to parse it and persist it to a table

  • 0 kudos
User16826992666
by Valued Contributor
  • 2022 Views
  • 1 replies
  • 0 kudos

Resolved! If I create a clone of a Delta table, does it stay in sync with the original table?

Basically wondering what happens to the clone when updates are made to the original Delta table. Will the changes apply to the cloned table as well?

  • 2022 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

The clone is not a replica and so updates made to the original delta table wouldn't be applies to the clone. However, shallow clones reference data files in the source directory. If you run vacuum on the source table, clients will no longer be able t...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1454 Views
  • 1 replies
  • 0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

  • 1454 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1240 Views
  • 1 replies
  • 0 kudos

Resolved! Do I still need to use skew join hints if I have Adaptive Query Execution enabled?

From what I have read about AQE it seems to do a lot of what skew join hints did automatically. So should I still be using skew hints in my queries? Is there harm in using them?

  • 1240 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

With AQE Databricks  has the most up-to-date accurate statistics at the end of a query stage and can opt for a better physical strategy and or do optimizations that used to require hints,In the case of skew join hints, is recommended to rely on AQE...

  • 0 kudos
User15787040559
by Databricks Employee
  • 2302 Views
  • 1 replies
  • 0 kudos
  • 2302 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

In addition to subscription limits, the total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels