- 1576 Views
- 1 replies
- 0 kudos
from pyspark.sql.types import *
schema = StructType([
StructField("c1", IntegerType(), True),
StructField("c2", StringType(), True),
StructField("c3", StringType(), True)])
df = spark.createDataFrame([(1, "2", None), (3, "4", None)], schema)
- 1576 Views
- 1 replies
- 0 kudos
Latest Reply
df = spark.createDataFrame(sc.emptyRDD(), schema)Can you try this?
- 2636 Views
- 1 replies
- 1 kudos
Some of the Jobs are failing in prod with below error message:Can you please check and let us know the reason for this? These are running under pool cluster.Run result unavailable: job failed with error messageUnexpected failure while waiting for the...
- 2636 Views
- 1 replies
- 1 kudos
Latest Reply
@Kunal Gaurav​ , This status code only occurs in one of two conditions:We’re able to request the instances for the cluster but can’t bootstrap them in time We setup the containers on each instance, but can’t start the containers in timethis is an edg...
- 914 Views
- 2 replies
- 1 kudos
I would like to prevent file download via /files/ URL. For example: https://customer.databricks.com/files/some-file-in-the-filestore.txtIs there a way to do this?
- 914 Views
- 2 replies
- 1 kudos
Latest Reply
Unfortunately this is not possible from the platform.You can however use an external Web Application Firewall (e.g. Akmai) to filter all web traffic to your workspaces. This can block both Web access to download root bucket data.
1 More Replies
- 1322 Views
- 1 replies
- 1 kudos
I would like to know if there are any broadcast joins limitations.
- 1322 Views
- 1 replies
- 1 kudos
Latest Reply
Yes, there are a couple limitation. Please find below the details:> It will not perform broadcast join if the table has 512 million or more rows > It will not perform broadcast join if the table is larger than 8GB
- 1615 Views
- 1 replies
- 1 kudos
I would like to know how do disable broadcast join in my job to avoid this error message. Is there a Spark configuration?
- 1615 Views
- 1 replies
- 1 kudos
Latest Reply
You can disable broadcast join by adding the following Spark configuration to you notebook:spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)In addition, you can also add this configuration to your cluster:spark.sql.autoBroadcastJoinThreshold...
- 1288 Views
- 1 replies
- 0 kudos
Im getting some weird messages when trying to run my Dbconnect. I would like to know if there is a troubleshooting guide to solve Python version mismatch errors.
- 1288 Views
- 1 replies
- 0 kudos
Latest Reply
We have a troubleshooting section in our docs that could help you to solve this issue. Please check the docs here https://docs.databricks.com/dev-tools/databricks-connect.html#python-version-mismatch
- 993 Views
- 1 replies
- 0 kudos
I would like to know if I can use Dbconnect to run all my structured streaming jobs.
- 993 Views
- 1 replies
- 0 kudos
Latest Reply
Unfortunately, no. You cannot use Dbconnect for your streaming jobs. This is one of Dbconnect's limitations. For more details please check the docs: https://docs.databricks.com/dev-tools/databricks-connect.html#limitations
- 1308 Views
- 1 replies
- 0 kudos
I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?
- 1308 Views
- 1 replies
- 0 kudos
Latest Reply
It would depend on how frequently you update the table and how often you read it. If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance imp...
- 1619 Views
- 1 replies
- 0 kudos
I am new to using Databricks and want to create a cluster, but there are many different worker types to choose from. How do I know which worker type is the right type for my use case?
- 1619 Views
- 1 replies
- 0 kudos
Latest Reply
For delta workloads, where you could benefit from caching it is recommended to use storage optimized instances that come with NVMe SSDs. For other workloads, it would be a good idea to check Ganglia metrics to see whether your workload is Cpu/Memory ...