cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

User15787040559
by Databricks Employee
  • 4241 Views
  • 2 replies
  • 0 kudos

How to do a unionAll() when the number and the name of columns are different?

Looking at the API for Dataframe.unionAll() when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.How can you do it?One possible solution is using the following function which performs the union of tw...

  • 4241 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

I'm not sure union is the right tool, if the DataFrames have fundamentally different information in them. If the difference is merely column name, yes, rename. If they don't, then the 'union' contemplated here is really a union of columns as well as ...

  • 0 kudos
1 More Replies
User16826994223
by Honored Contributor III
  • 910 Views
  • 1 replies
  • 0 kudos

Start photon cluster

How to start a photon cluster, where I can fins the pricing of photon Cluster

  • 910 Views
  • 1 replies
  • 0 kudos
Latest Reply
craig_ng
New Contributor III
  • 0 kudos

As of the time of this message, Photon availability in the Data Science & Engineering workspace in Public Preview on AWS. You can reference our docs for instructions on how to provision a cluster using a Photon-enabled runtime. As for pricing, we tre...

  • 0 kudos
Anonymous
by Not applicable
  • 987 Views
  • 1 replies
  • 0 kudos
  • 987 Views
  • 1 replies
  • 0 kudos
Latest Reply
craig_ng
New Contributor III
  • 0 kudos

We list the OS version in the "Environment" section of each runtime version's release notes. See link to all the runtime release notes here: https://docs.databricks.com/release-notes/runtime/releases.html

  • 0 kudos
MoJaMa
by Databricks Employee
  • 999 Views
  • 1 replies
  • 0 kudos
  • 999 Views
  • 1 replies
  • 0 kudos
Latest Reply
MoJaMa
Databricks Employee
  • 0 kudos

Hosting your own internal PyPI mirror. That will allow you to manage and approve packages vs directly downloading from public PyPI and then also would remove dependency on an external serviceUpload all wheel files to DBFS, maybe through a CI/CD proce...

  • 0 kudos
User16860826802
by New Contributor III
  • 6897 Views
  • 1 replies
  • 1 kudos

Resolved! Why does my cluster keeps disappearing?

My team and I were using a cluster for some days and it disappeared without any apparent reason. I recreate the cluster, but after some days it disappeared again. Do you know why my cluster disappeared? how to avoid that?

  • 6897 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16860826802
New Contributor III
  • 1 kudos

A cluster is deleted after 30 days after a cluster is terminated. To keep an all-purpose cluster configuration even after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 70 clusters can be pinned.To av...

  • 1 kudos
Anonymous
by Not applicable
  • 1121 Views
  • 1 replies
  • 0 kudos
  • 1121 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752244127
Contributor
  • 0 kudos

Delta Lake is a data storage and management layer that fixes the issues with existing data lakes, e.g. on S3, GCS or ADLS. Delta supports streaming and batch operations. It's an open source project, donated to the Linux Foundation. You can check it o...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 6217 Views
  • 1 replies
  • 1 kudos

What is overwatch ?

I heard Databricks recommend overwatch for monitoring Clusters, Can anybody help like what all metrics it will provide that , how it is helpful in monitoring or better than ganglia ?

  • 6217 Views
  • 1 replies
  • 1 kudos
Latest Reply
alexott
Databricks Employee
  • 1 kudos

Overwatch is a different kind of tool - right now it couldn't be used for real-time monitoring, like, Ganglia. Overwatch collects data from the multiple data sources (audit logs, APIs, cluster logs, etc.), process, enrich and aggregate them following...

  • 1 kudos
User16826994223
by Honored Contributor III
  • 2059 Views
  • 1 replies
  • 0 kudos

Resolved! How to find best model using python in mlflow

I have a use case in mlflow with python code to find a model version that has the best metric (for instance, “accuracy”) among so many versions , I don't want to use web ui but to use python code to achieve this. Any Idea?

  • 2059 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

import mlflow client = mlflow.tracking.MlflowClient() runs = client.search_runs("my_experiment_id", "", order_by=["metrics.rmse DESC"], max_results=1) best_run = runs[0]https://mlflow.org/docs/latest/python_api/mlflow.tracking.html#mlflow.tracking.M...

  • 0 kudos
alexott
by Databricks Employee
  • 2509 Views
  • 1 replies
  • 0 kudos

What libraries could be used for unit testing of the Spark code?

We need to add unit test cases for our code that we're writing using the Scala in Python. But we can't use the calls like `assertEqual` for comparing the content of DataFrames. Are any special libraries for that?

  • 2509 Views
  • 1 replies
  • 0 kudos
Latest Reply
alexott
Databricks Employee
  • 0 kudos

There are several libraries for Scala and Python that help with writing unit tests for Spark code.For Scala you can use following:Built-in Spark test suite - it's designed to test all parts of Spark. It supports RDD, Dataframe/Dataset, Streaming API...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 1042 Views
  • 0 replies
  • 0 kudos

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud ...

How does Delta Sharing work?Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS or GCS, to reliably transfer large datasets. There are two parties...

blog-delta-sharing-under-the-hood
  • 1042 Views
  • 0 replies
  • 0 kudos
User16790091296
by Contributor II
  • 2195 Views
  • 2 replies
  • 0 kudos
  • 2195 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

Generally it is limited by cloud provider, initially yo get around 350 cores that can be increased by request to cloud vendor, Till now I have seen 1000 cores and it can go much moreIn addition to subscription limits, the total capacity of cluster...

  • 0 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 1633 Views
  • 1 replies
  • 0 kudos

Resolved! If I create a shallow clone of a Delta table, then add data to the clone, where is that data stored?

Since a shallow clone only copies the metadata of the original table, I'm wondering where new data would end up. Is it even possible to add data to a shallow clone? Is the data written back to the original source file location?

  • 1633 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

Shallow Clones are really useful for short-lived use cases such as testing and experimentation . It duplicates the metadata from the source table - and any new data added would go to the location specified while creating the shallow table. >Is the da...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels