cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

aladda
by Databricks Employee
  • 4781 Views
  • 1 replies
  • 0 kudos
  • 4781 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Spark's execution engine is designed to be Lazy. In effect, you're first up build up your analytics/data processing request through a series of Transformations which are then executed by an ActionTransformations are kind of operations which will tran...

  • 0 kudos
aladda
by Databricks Employee
  • 22077 Views
  • 2 replies
  • 0 kudos
  • 22077 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

%run is copying code from another notebook and executing it within the one its called from. All variables defined in the notebook being called are therefore visible to the caller notebook dbutils.notebook.run() is more around executing different note...

  • 0 kudos
1 More Replies
aladda
by Databricks Employee
  • 76799 Views
  • 2 replies
  • 1 kudos
  • 76799 Views
  • 2 replies
  • 1 kudos
Latest Reply
aladda
Databricks Employee
  • 1 kudos

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax fo...

  • 1 kudos
1 More Replies
aladda
by Databricks Employee
  • 5076 Views
  • 2 replies
  • 0 kudos
  • 5076 Views
  • 2 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Download & Install the Databricks ODBC DriverGet the hostname, port, HTTP Path as described here â€“ there’s slightly different steps for cluster (DDE) or SQL endpoint (DSQL)Get a PAT tokenUse the curl command to validate the network settings using the...

  • 0 kudos
1 More Replies
User15787040559
by Databricks Employee
  • 4517 Views
  • 1 replies
  • 0 kudos

How can I create from scratch a brand new Dataframe with Null values using spark.createDataFrame()?

from pyspark.sql.types import * schema = StructType([ StructField("c1", IntegerType(), True), StructField("c2", StringType(), True), StructField("c3", StringType(), True)]) df = spark.createDataFrame([(1, "2", None), (3, "4", None)], schema)

  • 4517 Views
  • 1 replies
  • 0 kudos
Latest Reply
Mooune_DBU
Databricks Employee
  • 0 kudos

df = spark.createDataFrame(sc.emptyRDD(), schema)Can you try this?

  • 0 kudos
User16826994223
by Databricks Employee
  • 4597 Views
  • 1 replies
  • 1 kudos

Resolved! cluster start Issues

Some of the Jobs are failing in prod with below error message:Can you please check and let us know the reason for this? These are running under pool cluster.Run result unavailable: job failed with error messageUnexpected failure while waiting for the...

  • 4597 Views
  • 1 replies
  • 1 kudos
Latest Reply
Mooune_DBU
Databricks Employee
  • 1 kudos

@Kunal Gaurav​ , This status code only occurs in one of two conditions:We’re able to request the instances for the cluster but can’t bootstrap them in time We setup the containers on each instance, but can’t start the containers in timethis is an edg...

  • 1 kudos
RonanStokes_DB
by Databricks Employee
  • 2179 Views
  • 1 replies
  • 0 kudos
  • 2179 Views
  • 1 replies
  • 0 kudos
Latest Reply
Mooune_DBU
Databricks Employee
  • 0 kudos

Can you elaborate mode what you mean by "Encoder" (is it a serializing mechanism), what are the custom data objects? pyspark does support complex and binary formats as long as you can write your own serializer/deserializer.

  • 0 kudos
User16826987838
by Databricks Employee
  • 2245 Views
  • 2 replies
  • 1 kudos

Prevent file downloads from /files/ URL

I would like to prevent file download via  /files/ URL. For example: https://customer.databricks.com/files/some-file-in-the-filestore.txtIs there a way to do this?

  • 2245 Views
  • 2 replies
  • 1 kudos
Latest Reply
Mooune_DBU
Databricks Employee
  • 1 kudos

Unfortunately this is not possible from the platform.You can however use an external Web Application Firewall (e.g. Akmai) to filter all web traffic to your workspaces.  This can block both Web access to download root bucket data.

  • 1 kudos
1 More Replies
jose_gonzalez
by Databricks Employee
  • 3986 Views
  • 1 replies
  • 1 kudos

Resolved! Are there any limitations on my broadcast joins?

I would like to know if there are any broadcast joins limitations.

  • 3986 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Yes, there are a couple limitation. Please find below the details:> It will not perform broadcast join if the table has 512 million or more rows > It will not perform broadcast join if the table is larger than 8GB

  • 1 kudos
jose_gonzalez
by Databricks Employee
  • 4074 Views
  • 1 replies
  • 1 kudos

Resolved! Getting broadcast join errors

I would like to know how do disable broadcast join in my job to avoid this error message. Is there a Spark configuration?

  • 4074 Views
  • 1 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

You can disable broadcast join by adding the following Spark configuration to you notebook:spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)In addition, you can also add this configuration to your cluster:spark.sql.autoBroadcastJoinThreshold...

  • 1 kudos
jose_gonzalez
by Databricks Employee
  • 3729 Views
  • 1 replies
  • 0 kudos

Resolved! how to troubleshot Python version mismatch error in DBconnect?

Im getting some weird messages when trying to run my Dbconnect. I would like to know if there is a troubleshooting guide to solve Python version mismatch errors.

  • 3729 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

We have a troubleshooting section in our docs that could help you to solve this issue. Please check the docs here https://docs.databricks.com/dev-tools/databricks-connect.html#python-version-mismatch

  • 0 kudos
jose_gonzalez
by Databricks Employee
  • 2932 Views
  • 1 replies
  • 0 kudos

Resolved! can I use Dbconnect for my structured streaming jobs?

I would like to know if I can use Dbconnect to run all my structured streaming jobs.

  • 2932 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Unfortunately, no. You cannot use Dbconnect for your streaming jobs. This is one of Dbconnect's limitations. For more details please check the docs: https://docs.databricks.com/dev-tools/databricks-connect.html#limitations

  • 0 kudos
User16826992666
by Databricks Employee
  • 3963 Views
  • 1 replies
  • 0 kudos

Resolved! How often should I run OPTIMIZE on my Delta Tables?

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

  • 3963 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

It would depend on how frequently you update the table and how often you read it. If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance imp...

  • 0 kudos
User16826992666
by Databricks Employee
  • 5180 Views
  • 1 replies
  • 0 kudos

Resolved! How do I know which worker type to choose when creating my cluster?

I am new to using Databricks and want to create a cluster, but there are many different worker types to choose from. How do I know which worker type is the right type for my use case?

  • 5180 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Databricks Employee
  • 0 kudos

For delta workloads, where you could benefit from caching it is recommended to use storage optimized instances that come with NVMe SSDs. For other workloads, it would be a good idea to check Ganglia metrics to see whether your workload is Cpu/Memory ...

  • 0 kudos
Labels