cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

skolukmar
by New Contributor
  • 1142 Views
  • 2 replies
  • 0 kudos

Delta Live Tables: control microbatch size

A delta live table pipeline reads a delta table on databricks. Is it possible to limit the size of microbatch during data transformation?I am thinking about a solution used by spark structured streaming that enables control of batch size using:.optio...

  • 1142 Views
  • 2 replies
  • 0 kudos
Latest Reply
lprevost
Contributor II
  • 0 kudos

One other thought -- if you are considering using pandas_udf api, there is a way to control batch size there:pandas_udf guide   note the comments there about arrow batch size params.

  • 0 kudos
1 More Replies
gpierard
by New Contributor III
  • 22368 Views
  • 3 replies
  • 1 kudos

Resolved! how to list all spark session config variables

In databricks I can set a config variable at session level, but it is not found in the context variables:spark.conf.set(f"dataset.bookstore", '123') #dataset_bookstore spark.conf.get(f"dataset.bookstore")#123 scf = spark.sparkContext.getConf() allc =...

  • 22368 Views
  • 3 replies
  • 1 kudos
Latest Reply
RyanHager
Contributor
  • 1 kudos

A while back I think I found a way to get python to list all the config values.  I was not able to re-create it.  Just make one of your notebook code sections scala (first line) and use the second line: %scala(spark.conf.getAll).foreach(println)

  • 1 kudos
2 More Replies
Twilight
by Contributor
  • 1409 Views
  • 2 replies
  • 3 kudos

web terminal accessing /Workspace/Users under tmux

I found this old post (https://community.databricks.com/t5/data-engineering/databricks-cluster-web-terminal-different-permissions-with-tmux/td-p/26461) that was never really answered.I am having the same problem.  If I am in the raw terminal, I can a...

  • 1409 Views
  • 2 replies
  • 3 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 3 kudos

Hi @Twilight, To resolve this, ensure the `tmux` session runs under the same user context as the raw terminal, verify environment variables are set correctly, initialize `tmux` with the same shell and environment settings, check for any ACLs on the `...

  • 3 kudos
1 More Replies
oripsk
by New Contributor
  • 699 Views
  • 1 replies
  • 0 kudos

Column ordering when querying a clustered table

If I have a table which is clustered by (a, b, c) and I issue a query filtering on (b, c), will the query benefit from the optimization by the cluster of (a, b, c)?  

  • 699 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @oripsk, When you query a table clustered by columns (a, b, c) and filter on (b, c), the query will not fully benefit from the clustering optimization. Clustering works best when the query filter includes the leading column(s) in the clustering or...

  • 0 kudos
Anonymous
by Not applicable
  • 21395 Views
  • 2 replies
  • 3 kudos
  • 21395 Views
  • 2 replies
  • 3 kudos
Latest Reply
zerasmus
Contributor
  • 3 kudos

On newer Databricks Runtime versions, %conda commands are not supported. You can use %pip commands instead:%pip list I have tested this on Databricks Runtime 15.4 LTS Beta.

  • 3 kudos
1 More Replies
ad_k
by New Contributor
  • 730 Views
  • 1 replies
  • 0 kudos

Create delta files from Unity Catalog Objects

Hello,I have tables created on unity catalog that point to the raw area , from these tables I need to create a data model (facts and dimensions) that will aggregate this data, transform certain things. Then I need to store in the Azure Datalake in de...

  • 730 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @ad_k, To create a data model from Unity Catalog tables and store it in Azure data lake in Delta format, use Databricks Notebooks with PySpark or SQL. The process involves reading raw data from Unity Catalog, transforming it into fact and dimensio...

  • 0 kudos
TimB
by New Contributor III
  • 15222 Views
  • 9 replies
  • 3 kudos

Passing multiple paths to .load in autoloader

I am trying to use autoloader to load data from two different blobs from within the same account so that spark will discover the data asynchronously. However, when I try this, it doesn't work and I get the error outlined below. Can anyone point out w...

  • 15222 Views
  • 9 replies
  • 3 kudos
Latest Reply
TimB
New Contributor III
  • 3 kudos

If were were to upgrade to ADLSg2, but retain the same structure, would there be scope for this method above to be improved (besides moving to notification mode)?

  • 3 kudos
8 More Replies
Venky
by New Contributor III
  • 112552 Views
  • 18 replies
  • 20 kudos

Resolved! i am trying to read csv file using databricks, i am getting error like ......FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/world_bank.csv'

i am trying to read csv file using databricks, i am getting error like ......FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/world_bank.csv'

image
  • 112552 Views
  • 18 replies
  • 20 kudos
Latest Reply
Alexis
New Contributor III
  • 20 kudos

Hiyou can try: my_df = spark.read.format("csv")      .option("inferSchema","true")  # to get the types from your data      .option("sep",",")            # if your file is using "," as separator      .option("header","true")       # if you...

  • 20 kudos
17 More Replies
mexcram
by New Contributor II
  • 2437 Views
  • 2 replies
  • 2 kudos

Glue database and saveAsTable

Hello all,I am saving my data frame as a Delta Table to S3 and AWS Glue using pyspark and `saveAsTable`, so far I can do this but something curious happens when I try to change the `path` (as an option or as an argument of `saveAsTable`).The location...

  • 2437 Views
  • 2 replies
  • 2 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 2 kudos

Hi @mexcram, When saving a DataFrame as a Delta Table to S3 and AWS Glue using PySpark's `saveAsTable`, changing the `path` option or argument often results in the Glue table location being set to a placeholder path (e.g., `s3://my-bucket/my_table-__...

  • 2 kudos
1 More Replies
SeyedA
by New Contributor
  • 901 Views
  • 1 replies
  • 0 kudos

Debug UDFs using VSCode extension

I am trying to debug my python script using Databricks VSCode extension. I am using udf and pandas_udf in my script. Everything works fine except when the execution gets to the udf and pandas_udf usages. It then complains that "SparkContext or SparkS...

  • 901 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @SeyedA, To resolve this, first, ensure your SparkSession is properly initialized in your script. Be aware of the limitations of Databricks Connect, which might affect UDFs, and consider running UDFs locally in a simple Spark environment for debug...

  • 0 kudos
hpant1
by New Contributor III
  • 647 Views
  • 1 replies
  • 0 kudos

Not able to write in the schema stored in the external location

I have three tables and I trying to write it in the bronze schema which is store in the external location. I am able to write two of those but for third one I am getting this error:Not sure why is the case, I am exactly doing the same.

hpant1_0-1722957487318.png
  • 647 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @hpant1, To fix this, ensure that the SAS token is properly configured with the necessary write permissions and regenerate it if needed. Verify that the storage account is accessible and check for network issues. Confirm that IAM policies grant th...

  • 0 kudos
hpant
by New Contributor III
  • 6170 Views
  • 5 replies
  • 1 kudos

Autoloader error "Failed to infer schema for format json from existing files in input"

I have two json files in one of the location in Azure gen 2 storage e.g. '/mnt/abc/Testing/'. When I trying to read the files using autoloader I am getting this error: "Failed to infer schema for format json from existing files in input path /mnt/abc...

  • 6170 Views
  • 5 replies
  • 1 kudos
Latest Reply
holly
Databricks Employee
  • 1 kudos

Hi @hpant would you consider testing the new VARIANT type for your JSON data? I appreciate it will require rewriting the next step in your pipeline, but should be more robust wrt errors.  Disclaimer: I haven't personally tested variant with Autoloade...

  • 1 kudos
4 More Replies
Devsql
by New Contributor III
  • 809 Views
  • 1 replies
  • 1 kudos

For a given Notebook, how to find the calling Job

Hi Team,I came across a situation that I have a Notebook but I am Not able to find a Job/DLT which calls this Notebook.So is there any query or any mechanism, using which i can find out ( or i can list out ) Jobs/scripts which has called given Notebo...

Data Engineering
Azure Databricks
  • 809 Views
  • 1 replies
  • 1 kudos
Latest Reply
Devsql
New Contributor III
  • 1 kudos

Hi @Retired_mod , would you like to help me for above question !!!

  • 1 kudos
Rajdeepak
by New Contributor
  • 1973 Views
  • 1 replies
  • 0 kudos

How to restart failed spark stream job from the failure point

I am setting up a ETL process using pyspark. My input is a kafka stream and i am writing output to multiple sink (one into kafka and another into cloud storage). I am writing checkpoints on the cloud storage. The issue i am facing is that, whenever m...

  • 1973 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @Rajdeepak, To address data redundancy issues caused by reprocessing during application restarts, consider these strategies: Ensure proper checkpointing by configuring and protecting your checkpoint directory; manage Kafka offsets correctly by set...

  • 0 kudos
reachrishav
by New Contributor II
  • 2033 Views
  • 1 replies
  • 0 kudos

What is the equivalent of "if exists()" in databricks sql?

What is the equivalent of the below sql server syntax in databricks sql? there are cases where i need to execute a block of sql code on certain conditions. I know this can be achieved with spark.sql, but the problem with spark.sql()  is it does not p...

  • 2033 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @reachrishav, In Databricks SQL, you can replicate SQL Server's conditional logic using `CASE` statements and `MERGE` operations. Since Databricks SQL doesn't support `IF EXISTS` directly, you can create a temporary view to check your condition an...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels