cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16826994223
by Honored Contributor III
  • 768 Views
  • 1 replies
  • 0 kudos

Delta concurrency write Issue

What is concurrent issue in delta, If at a time if we try to write same delta table , it some times fail , how to mitigate that

  • 768 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Read: Reads (if needed) the latest available version of the table to identify which files need to be modified (that is, rewritten).Write: Stages all th...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 781 Views
  • 1 replies
  • 1 kudos
  • 781 Views
  • 1 replies
  • 1 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 1 kudos

You'd need to open connections to Databricks web applicationDatabricks secure cluster connectivity (SCC) relayAWS S3 global URLAWS S3 regional URLAWS STS global URLAWS STS regional URLAWS Kinesis regional URLTable metastore RDS regional URL (by data ...

  • 1 kudos
Anonymous
by Not applicable
  • 884 Views
  • 2 replies
  • 0 kudos

Resolved! Collaborative features

What do you mean by collaborative data science? What collaboration features do you support?

  • 884 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

This primarily refers to the fact that notebooks can be shared to the whole org, to groups, to users, and can be limited to read/write/execute. You could argue that MLflow is also a form of collaboration, where multiple users can share an experiment ...

  • 0 kudos
1 More Replies
Srikanth_Gupta_
by Valued Contributor
  • 1569 Views
  • 2 replies
  • 0 kudos

What are best instance types to use Delta Lake on AWS, Azure and GCP?

Best instance types to use Delta in a better way, are there any recommendations?Example: i3.xlarge vs m5.2x large vs D3v2

  • 1569 Views
  • 2 replies
  • 0 kudos
Latest Reply
Mooune_DBU
Valued Contributor
  • 0 kudos

Depending on your queries, if you're looking for Delta Cache Optimized instances, here's the list per provider:AWS: i3.* (i.e. i3.xlarge)Azure: Ls-types (i.e. L4sv2)GCP: n2-highmem-*

  • 0 kudos
1 More Replies
User16790091296
by Contributor II
  • 1684 Views
  • 1 replies
  • 0 kudos
  • 1684 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Broadly, it's because high-concurrency cluster have to have much more control of user workloads in order to enforce resource sharing constraints. Scala is the lowest-level language you can access in Databricks, as you execute directly in the JVM, and...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 820 Views
  • 1 replies
  • 0 kudos

multitask in Databricks

Hi Team is there any way we can utilize same cluster to run multiple dependent jobs in multi-task, starting cluster for every jobs take time

  • 820 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16830818524
New Contributor II
  • 0 kudos

At this time it is not possible

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 890 Views
  • 1 replies
  • 0 kudos
  • 890 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Does this help? "No Public IPs": https://docs.microsoft.com/en-us/azure/databricks/security/secure-cluster-connectivity

  • 0 kudos
User16826994223
by Honored Contributor III
  • 3041 Views
  • 1 replies
  • 0 kudos

How to Log Pickle files as a part of Mlflow experiment run

I want to log certain artifacts as python pickle as part of mlflow experimentIs there a way to achieve this?

  • 3041 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Sure, pickle the object to a local file. Log it to your current run with mlflow.log_artifact. That's it. MLflow lets you log just about anything you want. However if you're experimenting with different variations on a sklearn Pipeline model, you coul...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1226 Views
  • 1 replies
  • 0 kudos
  • 1226 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Honored Contributor III
  • 0 kudos

Standard tiers are allowed to have 1000 saved jobs. Premium tiers have a higher limit at 1500. Some clouds have an enterprise tier which has a saved job limit of 2000. A workspace is limited to 1000 concurrent job runs. A 429 Too Many Requests respon...

  • 0 kudos
User16826992185
by New Contributor II
  • 3700 Views
  • 1 replies
  • 0 kudos

Delta vs. Parquet

I'm curious about the benefits of using the Delta file format vs. Parquet. Is there any downside to using Delta?

  • 3700 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Not really. You get upsides like transactions, time travel, upsert/merge/deletes. There is some cost to that, as Delta manages that by writing and managing many smaller Parquet files and has to re-read them to recreate the current or past state of th...

  • 0 kudos
sajith_appukutt
by Honored Contributor II
  • 970 Views
  • 1 replies
  • 0 kudos

Resolved! I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

Though the data volume is relatively even, the  streaming aggregation query is showing highly variable micro-batch processing times

  • 970 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streamin...

  • 0 kudos
Anonymous
by Not applicable
  • 1188 Views
  • 1 replies
  • 0 kudos
  • 1188 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

DBFS is the "Databricks File System", but really it's just a shim / wrapper on top of distributed storage, that makes files in S3 or ADLS look like local files under the path /dbfs/... This can be really useful when working with libraries that do not...

  • 0 kudos
User16826992666
by Valued Contributor
  • 5361 Views
  • 1 replies
  • 0 kudos

Resolved! When should I choose a different driver type on my cluster vs the worker type?

When creating a cluster the driver type defaults to choose the same type as the workers, and this is what I usually choose. But in what of situation would I want to choose a different driver type?

  • 5361 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Using the same instance type is a fine default. If you know that you need very large workers, but little happens on the driver, maybe you can save money with a smaller driver. Conversely, you may know that some parts of your notebook involve a lot of...

  • 0 kudos
User16826992666
by Valued Contributor
  • 2010 Views
  • 1 replies
  • 0 kudos

Resolved! Is there a limit to the number of data points displayed in notebook visualizations?

I know that when you display the results of queries in notebooks there is a limit to the number of rows that are shown. Is there a similar limit to the results that are displayed in visuals within notebooks?

  • 2010 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

Yes, still limited to 1000 rows / data points. However, when your visualization involves things like sums or averages of a Spark DataFrame's result, those will be performed on the cluster, so would involve maybe many more than 1000 data points, even ...

  • 0 kudos
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels