Data Engineering

Forum Posts

Sorted by:

Start a conversation

by aladda • Honored Contributor II

06-08-2021 1:14:12 PM

11164 Views
2 replies
0 kudos

Resolved! What's the difference between %run vs dbutils.notebook.run

Data Engineering

11164 Views
2 replies
0 kudos

06-08-2021 1:14:12 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-19-2021 8:29:47 PM

0 kudos

%run is copying code from another notebook and executing it within the one its called from. All variables defined in the notebook being called are therefore visible to the caller notebook dbutils.notebook.run() is more around executing different note...

0 kudos

06-19-2021 8:29:47 PM

1 More Replies

by aladda • Honored Contributor II

05-28-2021 12:23:24 PM

21813 Views
2 replies
1 kudos

Resolved! What is Z-ordering in Delta and what are some best practices on using it?

Data Engineering

21813 Views
2 replies
1 kudos

05-28-2021 12:23:24 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-19-2021 8:25:11 PM

1 kudos

Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Syntax fo...

1 kudos

06-19-2021 8:25:11 PM

1 More Replies

by aladda • Honored Contributor II

06-04-2021 1:35:09 PM

1591 Views
2 replies
0 kudos

Resolved! Can you provide steps for connecting to Tableau?

Data Engineering

1591 Views
2 replies
0 kudos

06-04-2021 1:35:09 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-19-2021 8:24:20 PM

0 kudos

Download & Install the Databricks ODBC DriverGet the hostname, port, HTTP Path as described here – there’s slightly different steps for cluster (DDE) or SQL endpoint (DSQL)Get a PAT tokenUse the curl command to validate the network settings using the...

0 kudos

06-19-2021 8:24:20 PM

1 More Replies

by aladda • Honored Contributor II

05-28-2021 11:50:52 AM

8061 Views
2 replies
2 kudos

Resolved! What's the best practice on running ANALYZE on Delta Tables for query performance optimization?

Data Engineering

8061 Views
2 replies
2 kudos

05-28-2021 11:50:52 AM

View Replies

Latest Reply

aladda
Honored Contributor II

06-19-2021 8:21:49 PM

2 kudos

The ANALYZE Command specifically captures statistics which are relevant for the Cost Based Optimizer to make better decisions.The 32 columns of statistics that Delta auto-collects are specifically for data skipping. This is separate from the ANALYZE ...

2 kudos

06-19-2021 8:21:49 PM

1 More Replies

by User15787040559 • New Contributor III

06-07-2021 9:30:41 AM

1576 Views
1 replies
0 kudos

How can I create from scratch a brand new Dataframe with Null values using spark.createDataFrame()?

from pyspark.sql.types import * schema = StructType([ StructField("c1", IntegerType(), True), StructField("c2", StringType(), True), StructField("c3", StringType(), True)]) df = spark.createDataFrame([(1, "2", None), (3, "4", None)], schema)

Data Engineering

1576 Views
1 replies
0 kudos

06-07-2021 9:30:41 AM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-18-2021 5:09:07 PM

0 kudos

df = spark.createDataFrame(sc.emptyRDD(), schema)Can you try this?

0 kudos

06-18-2021 5:09:07 PM

by User16826994223 • Honored Contributor III

06-08-2021 4:54:32 AM

2636 Views
1 replies
1 kudos

Resolved! cluster start Issues

Some of the Jobs are failing in prod with below error message:Can you please check and let us know the reason for this? These are running under pool cluster.Run result unavailable: job failed with error messageUnexpected failure while waiting for the...

Data Engineering

2636 Views
1 replies
1 kudos

06-08-2021 4:54:32 AM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-18-2021 4:58:39 PM

1 kudos

@Kunal Gaurav , This status code only occurs in one of two conditions:We’re able to request the instances for the cluster but can’t bootstrap them in time We setup the containers on each instance, but can’t start the containers in timethis is an edg...

1 kudos

06-18-2021 4:58:39 PM

by RonanStokes_DB • New Contributor III

06-08-2021 10:21:02 AM

678 Views
1 replies
0 kudos

How do i create a custom encoder to enable use of custom data objects with pyspark datasets?

Data Engineering

678 Views
1 replies
0 kudos

06-08-2021 10:21:02 AM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-18-2021 4:51:09 PM

0 kudos

Can you elaborate mode what you mean by "Encoder" (is it a serializing mechanism), what are the custom data objects? pyspark does support complex and binary formats as long as you can write your own serializer/deserializer.

0 kudos

06-18-2021 4:51:09 PM

by User16826987838 • Contributor

06-18-2021 3:33:07 PM

914 Views
2 replies
1 kudos

Prevent file downloads from /files/ URL

I would like to prevent file download via /files/ URL. For example: https://customer.databricks.com/files/some-file-in-the-filestore.txtIs there a way to do this?

Data Engineering

914 Views
2 replies
1 kudos

06-18-2021 3:33:07 PM

View Replies

Latest Reply

Mooune_DBU
Valued Contributor

06-18-2021 4:46:13 PM

1 kudos

Unfortunately this is not possible from the platform.You can however use an external Web Application Firewall (e.g. Akmai) to filter all web traffic to your workspaces. This can block both Web access to download root bucket data.

1 kudos

06-18-2021 4:46:13 PM

1 More Replies

by jose_gonzalez • Moderator

06-18-2021 4:27:00 PM

1322 Views
1 replies
1 kudos

Resolved! Are there any limitations on my broadcast joins?

I would like to know if there are any broadcast joins limitations.

Data Engineering

1322 Views
1 replies
1 kudos

06-18-2021 4:27:00 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 4:28:12 PM

1 kudos

Yes, there are a couple limitation. Please find below the details:> It will not perform broadcast join if the table has 512 million or more rows > It will not perform broadcast join if the table is larger than 8GB

1 kudos

06-18-2021 4:28:12 PM

by jose_gonzalez • Moderator

06-18-2021 4:23:34 PM

1615 Views
1 replies
1 kudos

Resolved! Getting broadcast join errors

I would like to know how do disable broadcast join in my job to avoid this error message. Is there a Spark configuration?

Data Engineering

1615 Views
1 replies
1 kudos

06-18-2021 4:23:34 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 4:25:37 PM

1 kudos

You can disable broadcast join by adding the following Spark configuration to you notebook:spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)In addition, you can also add this configuration to your cluster:spark.sql.autoBroadcastJoinThreshold...

1 kudos

06-18-2021 4:25:37 PM

by User16826987838 • Contributor

06-18-2021 3:27:26 PM

895 Views
1 replies
1 kudos

Is it possible to run Databricks on Amazon EKS and not directly on EC2 instances?

Data Engineering

895 Views
1 replies
1 kudos

06-18-2021 3:27:26 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 4:21:11 PM

1 kudos

Databricks on AWS uses a custom cluster manager and not Kubernetes. EKS is not supported yet

1 kudos

06-18-2021 4:21:11 PM

by jose_gonzalez • Moderator

06-18-2021 4:15:56 PM

1288 Views
1 replies
0 kudos

Resolved! how to troubleshot Python version mismatch error in DBconnect?

Im getting some weird messages when trying to run my Dbconnect. I would like to know if there is a troubleshooting guide to solve Python version mismatch errors.

Data Engineering

1288 Views
1 replies
0 kudos

06-18-2021 4:15:56 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 4:17:18 PM

0 kudos

We have a troubleshooting section in our docs that could help you to solve this issue. Please check the docs here https://docs.databricks.com/dev-tools/databricks-connect.html#python-version-mismatch

0 kudos

06-18-2021 4:17:18 PM

by jose_gonzalez • Moderator

06-18-2021 4:11:09 PM

993 Views
1 replies
0 kudos

Resolved! can I use Dbconnect for my structured streaming jobs?

I would like to know if I can use Dbconnect to run all my structured streaming jobs.

Data Engineering

993 Views
1 replies
0 kudos

06-18-2021 4:11:09 PM

View Replies

Latest Reply

jose_gonzalez
Moderator

06-18-2021 4:12:22 PM

0 kudos

Unfortunately, no. You cannot use Dbconnect for your streaming jobs. This is one of Dbconnect's limitations. For more details please check the docs: https://docs.databricks.com/dev-tools/databricks-connect.html#limitations

0 kudos

06-18-2021 4:12:22 PM

by User16826992666 • Valued Contributor

06-15-2021 9:52:33 AM

1308 Views
1 replies
0 kudos

Resolved! How often should I run OPTIMIZE on my Delta Tables?

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

Data Engineering

1308 Views
1 replies
0 kudos

06-15-2021 9:52:33 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 4:02:27 PM

0 kudos

It would depend on how frequently you update the table and how often you read it. If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance imp...

0 kudos

06-18-2021 4:02:27 PM

by User16826992666 • Valued Contributor

06-15-2021 11:34:40 AM

1619 Views
1 replies
0 kudos

Resolved! How do I know which worker type to choose when creating my cluster?

I am new to using Databricks and want to create a cluster, but there are many different worker types to choose from. How do I know which worker type is the right type for my use case?

Data Engineering

1619 Views
1 replies
0 kudos

06-15-2021 11:34:40 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-18-2021 3:50:09 PM

0 kudos

For delta workloads, where you could benefit from caching it is recommended to use storage optimized instances that come with NVMe SSDs. For other workloads, it would be a good idea to check Ganglia metrics to see whether your workload is Cpu/Memory ...

0 kudos

06-18-2021 3:50:09 PM

User

Count

1602

736

343

284

247

Databricks

Forum Posts

Resolved! What's the difference between %run vs dbutils.notebook.run

Resolved! What is Z-ordering in Delta and what are some best practices on using it?

Resolved! Can you provide steps for connecting to Tableau?

Resolved! What's the best practice on running ANALYZE on Delta Tables for query performance optimization?

How can I create from scratch a brand new Dataframe with Null values using spark.createDataFrame()?

Resolved! cluster start Issues

How do i create a custom encoder to enable use of custom data objects with pyspark datasets?

Prevent file downloads from /files/ URL

Resolved! Are there any limitations on my broadcast joins?

Resolved! Getting broadcast join errors

Is it possible to run Databricks on Amazon EKS and not directly on EC2 instances?

Resolved! how to troubleshot Python version mismatch error in DBconnect?

Resolved! can I use Dbconnect for my structured streaming jobs?

Resolved! How often should I run OPTIMIZE on my Delta Tables?

Resolved! How do I know which worker type to choose when creating my cluster?

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...