cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Venky
by New Contributor III
  • 112122 Views
  • 18 replies
  • 20 kudos

Resolved! i am trying to read csv file using databricks, i am getting error like ......FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/world_bank.csv'

i am trying to read csv file using databricks, i am getting error like ......FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStore/tables/world_bank.csv'

image
  • 112122 Views
  • 18 replies
  • 20 kudos
Latest Reply
Alexis
New Contributor III
  • 20 kudos

Hiyou can try: my_df = spark.read.format("csv")      .option("inferSchema","true")  # to get the types from your data      .option("sep",",")            # if your file is using "," as separator      .option("header","true")       # if you...

  • 20 kudos
17 More Replies
mexcram
by New Contributor II
  • 2388 Views
  • 2 replies
  • 2 kudos

Glue database and saveAsTable

Hello all,I am saving my data frame as a Delta Table to S3 and AWS Glue using pyspark and `saveAsTable`, so far I can do this but something curious happens when I try to change the `path` (as an option or as an argument of `saveAsTable`).The location...

  • 2388 Views
  • 2 replies
  • 2 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 2 kudos

Hi @mexcram, When saving a DataFrame as a Delta Table to S3 and AWS Glue using PySpark's `saveAsTable`, changing the `path` option or argument often results in the Glue table location being set to a placeholder path (e.g., `s3://my-bucket/my_table-__...

  • 2 kudos
1 More Replies
SeyedA
by New Contributor
  • 881 Views
  • 1 replies
  • 0 kudos

Debug UDFs using VSCode extension

I am trying to debug my python script using Databricks VSCode extension. I am using udf and pandas_udf in my script. Everything works fine except when the execution gets to the udf and pandas_udf usages. It then complains that "SparkContext or SparkS...

  • 881 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @SeyedA, To resolve this, first, ensure your SparkSession is properly initialized in your script. Be aware of the limitations of Databricks Connect, which might affect UDFs, and consider running UDFs locally in a simple Spark environment for debug...

  • 0 kudos
hpant1
by New Contributor III
  • 645 Views
  • 1 replies
  • 0 kudos

Not able to write in the schema stored in the external location

I have three tables and I trying to write it in the bronze schema which is store in the external location. I am able to write two of those but for third one I am getting this error:Not sure why is the case, I am exactly doing the same.

hpant1_0-1722957487318.png
  • 645 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @hpant1, To fix this, ensure that the SAS token is properly configured with the necessary write permissions and regenerate it if needed. Verify that the storage account is accessible and check for network issues. Confirm that IAM policies grant th...

  • 0 kudos
hpant
by New Contributor III
  • 6060 Views
  • 5 replies
  • 1 kudos

Autoloader error "Failed to infer schema for format json from existing files in input"

I have two json files in one of the location in Azure gen 2 storage e.g. '/mnt/abc/Testing/'. When I trying to read the files using autoloader I am getting this error: "Failed to infer schema for format json from existing files in input path /mnt/abc...

  • 6060 Views
  • 5 replies
  • 1 kudos
Latest Reply
holly
Databricks Employee
  • 1 kudos

Hi @hpant would you consider testing the new VARIANT type for your JSON data? I appreciate it will require rewriting the next step in your pipeline, but should be more robust wrt errors.  Disclaimer: I haven't personally tested variant with Autoloade...

  • 1 kudos
4 More Replies
Devsql
by New Contributor III
  • 799 Views
  • 1 replies
  • 1 kudos

For a given Notebook, how to find the calling Job

Hi Team,I came across a situation that I have a Notebook but I am Not able to find a Job/DLT which calls this Notebook.So is there any query or any mechanism, using which i can find out ( or i can list out ) Jobs/scripts which has called given Notebo...

Data Engineering
Azure Databricks
  • 799 Views
  • 1 replies
  • 1 kudos
Latest Reply
Devsql
New Contributor III
  • 1 kudos

Hi @Retired_mod , would you like to help me for above question !!!

  • 1 kudos
Rajdeepak
by New Contributor
  • 1947 Views
  • 1 replies
  • 0 kudos

How to restart failed spark stream job from the failure point

I am setting up a ETL process using pyspark. My input is a kafka stream and i am writing output to multiple sink (one into kafka and another into cloud storage). I am writing checkpoints on the cloud storage. The issue i am facing is that, whenever m...

  • 1947 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @Rajdeepak, To address data redundancy issues caused by reprocessing during application restarts, consider these strategies: Ensure proper checkpointing by configuring and protecting your checkpoint directory; manage Kafka offsets correctly by set...

  • 0 kudos
reachrishav
by New Contributor II
  • 2002 Views
  • 1 replies
  • 0 kudos

What is the equivalent of "if exists()" in databricks sql?

What is the equivalent of the below sql server syntax in databricks sql? there are cases where i need to execute a block of sql code on certain conditions. I know this can be achieved with spark.sql, but the problem with spark.sql()  is it does not p...

  • 2002 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @reachrishav, In Databricks SQL, you can replicate SQL Server's conditional logic using `CASE` statements and `MERGE` operations. Since Databricks SQL doesn't support `IF EXISTS` directly, you can create a temporary view to check your condition an...

  • 0 kudos
ADB0513
by New Contributor III
  • 3104 Views
  • 1 replies
  • 0 kudos

Pass variable from one notebook to another

I have a main notebook where I am setting a python variable to the name of the catalog I want to work in.  I then call another notebook, using %run, which runs an insert into using a SQL command where I want to specify the catalog using the catalog v...

  • 3104 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @ADB0513, To pass variables between notebooks in Databricks, you can use three main methods: **Widgets**, where you create and retrieve parameters using `dbutils.widgets` in both notebooks; **spark.conf**, where you set and get configuration param...

  • 0 kudos
semsim
by Contributor
  • 4336 Views
  • 1 replies
  • 1 kudos

List and iterate over files in Databricks workspace

Hi DE Community,I need to be able to list/iterate over a set of files in a specific directory within the Databricks workspace. For example:"/Workspace/SharedFiles/path/to/file_1"..."/Workspace/SharedFiles/path/to/file_n"Thanks for your direction and ...

  • 4336 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @semsim ,You can use File system utility (dbutils.fs)Databricks Utilities (dbutils) reference | Databricks on AWSWork with files on Databricks | Databricks on AWSdbutils.fs.ls("file:/Workspace/Users/<user-folder>/")

  • 1 kudos
Zeruno
by New Contributor II
  • 2193 Views
  • 1 replies
  • 0 kudos

DLT - Get pipeline_id and update_id

I need to insert pipeline_id and update_id in my Delta Live Table (DLT), the point being to know which pipeline created which row. How can I obtain this information?I know you can get job_id and run_id from widgets but I don't know if these are the s...

  • 2193 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Zeruno ,Those values are rather static. Maybe you can design process that as a first step will extract information from List Pipepline API and save them in delta table.List pipelines | Pipelines API | REST API reference | Databricks on AWSThan in...

  • 0 kudos
vadi
by New Contributor
  • 818 Views
  • 2 replies
  • 0 kudos

csv file processing

whats best possible solution to process csv file in databricks.Please consider scalability,optimization, qa give m best solution...

  • 818 Views
  • 2 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @vadi, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community. If the response resolves your issue, kindly mark it as the accepted solution. This will help...

  • 0 kudos
1 More Replies
Shazaamzaa
by New Contributor III
  • 2151 Views
  • 1 replies
  • 0 kudos

Setup dbt-core with Azure Entra ID

Hey team, I'm trying to standardize the development environment setup in our team. I've written up a shell script that I want our devs to run in WSL2 after setup. The shell script does the following:1. setup Azure CLI - install and authenticate2. Ins...

  • 2151 Views
  • 1 replies
  • 0 kudos
Latest Reply
Shazaamzaa
New Contributor III
  • 0 kudos

Hey @Retired_mod thanks for the response. I persisted a little more with the logs and the issue appears to be related to WSL2 not having a backend credential manager to handle management of tokens supplied by the OAuth process. To be honest, this is ...

  • 0 kudos
acj1459
by New Contributor
  • 672 Views
  • 0 replies
  • 0 kudos

Azure Databricks Data Load

Hi All,I have 10 tables present on On-prem MS SQL DB and want to load 10 table data incrementally into Bronze delta table as append only. From Bronze to Silver , using merge query I want to load latest record into Silver delta table . Whatever latest...

  • 672 Views
  • 0 replies
  • 0 kudos
MRTN
by New Contributor III
  • 7023 Views
  • 3 replies
  • 2 kudos

Resolved! Configure multiple source paths for auto loader

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

  • 7023 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Morten Stakkeland​ :Yes, it's possible to configure an autoloader to read from multiple locations.You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since ...

  • 2 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels