cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

User16790091296
by Contributor II
  • 2658 Views
  • 1 replies
  • 0 kudos

Notebook path can't be in DBFS?

Some of us are working with IDEs and trying to deploy notebooks (.py) files to dbfs. the problem I have noticed is when configuring jobs, those paths are not recognized.notebook_path: If I use this :dbfs:/artifacts/client-state-vector/0.0.0/bootstrap...

  • 2658 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

The issue is that the python file saved under DBFS not as a workspace notebook. When you given /artifacts/client-state vector/0.0.0/bootstrap.py, the workspace will search the notebook(python file in this case) under the folder that under Workspace t...

  • 0 kudos
User16826994223
by Honored Contributor III
  • 921 Views
  • 1 replies
  • 0 kudos

Is it possible that only a particular cluster have only access to a s3 bucket or folder in s3

Hi I want to set up a cluster and want to give access to that cluster to some user only those user on that particular cluster should have access to read and write from and to the bucket. that particular bucket is not mounted on the workspace.Is th...

  • 921 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16752239289
Valued Contributor
  • 0 kudos

Yes, you can set up an instance profile that can access the S3 bucket and then only give certain users privilege to use the instance profile. For more details, you can check here

  • 0 kudos
StephanieAlba
by Valued Contributor III
  • 962 Views
  • 1 replies
  • 0 kudos

Is the delta schema enforcement flexible?

 In the sense that, is it possible to only check for column names or column data types or will it always be both?

  • 962 Views
  • 1 replies
  • 0 kudos
Latest Reply
StephanieAlba
Valued Contributor III
  • 0 kudos

No, I do not believe that is possible. However, I would be interested in understanding a use case where that is ideal behavior. How Does Schema Enforcement Work?Delta Lake uses schema validation on write, which means that all new writes to a table ar...

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 6244 Views
  • 3 replies
  • 1 kudos
  • 6244 Views
  • 3 replies
  • 1 kudos
Latest Reply
StephanieAlba
Valued Contributor III
  • 1 kudos

You can get a static IP at the workspace level https://docs.microsoft.com/en-us/azure/databricks/kb/cloud/azure-vnet-single-ip

  • 1 kudos
2 More Replies
tthorpe
by New Contributor
  • 54974 Views
  • 3 replies
  • 3 kudos

how do i delete files from the DBFS

I can't see where in the databricks UI that I can delete files that have been either uploaded or saved to the DBFS - how do I do this?

  • 54974 Views
  • 3 replies
  • 3 kudos
Latest Reply
SophieGou
New Contributor II
  • 3 kudos

Open a notebook and run the command dbutils.fs.rm("/FileStore/tables/your_table_name.csv") referencing this link https://docs.databricks.com/data/databricks-file-system.html

  • 3 kudos
2 More Replies
User16752239289
by Valued Contributor
  • 2976 Views
  • 1 replies
  • 1 kudos

Resolved! SparkR session failed to initialize

When run sparkR.session()I faced below error:Spark package found in SPARK_HOME: /databricks/spark   Launching java with spark-submit command /databricks/spark/bin/spark-submit sparkr-shell /tmp/Rtmp5hnW8G/backend_porte9141208532d   Error: Could not f...

  • 2976 Views
  • 1 replies
  • 1 kudos
Latest Reply
User16752239289
Valued Contributor
  • 1 kudos

This is due to the when users run their R scripts on Rstudio, the R session is not shut down gracefully. Databricks is working on handle the R session better and removed the limit. As a workaround, you can create and run below init script to increase...

  • 1 kudos
MikeBrewer
by New Contributor II
  • 17900 Views
  • 3 replies
  • 0 kudos

Am trying to use SQL, but createOrReplaceTempView("myDataView")​ fails

Am trying to use SQL, but createOrReplaceTempView("myDataView") fails. I can create and display a DataFrame fine... import pandas as pd df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns = ['Amount']) df I add another cell, ...

  • 17900 Views
  • 3 replies
  • 0 kudos
Latest Reply
sachinthana
New Contributor II
  • 0 kudos

This is worked for me. Thank you @acorson​ 

  • 0 kudos
2 More Replies
Kaniz_Fatma
by Community Manager
  • 821 Views
  • 1 replies
  • 0 kudos
  • 821 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

A metaclass in Python is a class of a class that defines how a class behaves. A class is itself an instance of a metaclass. A class in Python defines how the instance of the class will behave.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1077 Views
  • 1 replies
  • 0 kudos

What are the best practices for Adaptive query execution

What are common configurations used and which workload will get benefit

  • 1077 Views
  • 1 replies
  • 0 kudos
Latest Reply
amr
Valued Contributor
  • 0 kudos

Leave it turned on. the bet is with each Spark version released AQE will get better and better and eventually will lead to a much more performance optimisation plan than manually trying to tune it.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 1064 Views
  • 1 replies
  • 0 kudos
  • 1064 Views
  • 1 replies
  • 0 kudos
Latest Reply
amr
Valued Contributor
  • 0 kudos

Cartesian product is the worst type of join that you should always avoid, it tends to product N x M where N & M are the left and right table cardinality. unless you specifically want to use it, then you should avoid and opt for faster joins (inner, l...

  • 0 kudos
User16790091296
by Contributor II
  • 1614 Views
  • 1 replies
  • 0 kudos
  • 1614 Views
  • 1 replies
  • 0 kudos
Latest Reply
amr
Valued Contributor
  • 0 kudos

You need to get the REST service API access tokens and make sure the Databricks VPC (or VNET on Azure) have connectivity to the VPC where this REST API service resides.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 794 Views
  • 1 replies
  • 0 kudos
  • 794 Views
  • 1 replies
  • 0 kudos
Latest Reply
amr
Valued Contributor
  • 0 kudos

If the data in your table is huge, try to combine OPTIMIZE with WHERE so you only perform OPTIMIZE on a subset of the data rather than all data. see documentation here.

  • 0 kudos
brickster_2018
by Esteemed Contributor
  • 2908 Views
  • 1 replies
  • 1 kudos

Z-order or Hilbert Curve, which is better

For Optimize on Delta table, there is support for 2 spatial curve algorithms. Which is better. Which one to choose for my workload.

  • 2908 Views
  • 1 replies
  • 1 kudos
Latest Reply
amr
Valued Contributor
  • 1 kudos

The OPTIMIZE ZORDER operation now uses Hilbert space-filling curves by default. This approach provides better clustering characteristics than Z-order in higher dimensions. For Delta tables using OPTIMIZE ZORDER with many columns, Hilbert curves can s...

  • 1 kudos
MoJaMa
by Valued Contributor II
  • 2863 Views
  • 1 replies
  • 1 kudos
  • 2863 Views
  • 1 replies
  • 1 kudos
Latest Reply
amr
Valued Contributor
  • 1 kudos

Yes, Databricks support instance pools that will come from your reserved instance from Microsoft (provided you have an agreement), make sure your instance is on-demand to benefit from that, the other way to get cheaper VMs is to use Spot instances, t...

  • 1 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels