cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Akshay_Petkar
by Valued Contributor
  • 802 Views
  • 1 replies
  • 1 kudos

Resolved! How Auto Loader works – file level or row level?

Does Auto Loader work on file level or row level? If it works on file level and does not process the same file again, then how can we make it pick only the new rows when data is appended to that file?

  • 802 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @Akshay_Petkar ,Autoloader works on file level. Now, by default autoloader is configured with following option:cloudFiles.allowOverwrites = falseSo, above option causes files to be processed exactly once. But when you switch this option to true, t...

  • 1 kudos
sensanjoy
by Contributor II
  • 3220 Views
  • 8 replies
  • 1 kudos

Resolved! Accessing parameter defined in python notebook into sql notebook.

Hi All,I have one python notebook(../../config/param_notebook), where all parameters are defined, like:dbutils.widgets.text( "catalog", "catalog_de")spark.conf.set( "catalog.name", dbutils.widgets.get( "catalog"))dbutils.widgets.text( "schema", "emp"...

  • 3220 Views
  • 8 replies
  • 1 kudos
Latest Reply
Rupal_P
New Contributor II
  • 1 kudos

Hi all,I have a SQL notebook that contains the following statement:CREATE OR REPLACE MATERIALIZED VIEW ${catalog_name}.${schema_name}.emp_table ASSELECT ...I’ve configured the values for catalog_name and schema_name as pipeline parameters in my DLT p...

  • 1 kudos
7 More Replies
amrim
by New Contributor III
  • 851 Views
  • 1 replies
  • 1 kudos

Resolved! Notebook dashboard export unavailable

Hello,Recent changes in the databricks notebook dashboards have removed the option to download the dashboard as HTML.Previously it was possible to download it from the notebook dashboard view. Currently it's only possible to download the notebook its...

  • 851 Views
  • 1 replies
  • 1 kudos
Latest Reply
Advika
Community Manager
  • 1 kudos

Hello @amrim! You're right to flag this, thank you for bringing it up. I’ll check internally for any upcoming changes regarding this feature or alternative ways to download the notebook dashboard in HTML format. I’ll get back to you once I have an up...

  • 1 kudos
surajtr
by New Contributor
  • 858 Views
  • 1 replies
  • 0 kudos

Reading a large zip file containing NDJson file in Databricks

Hi,We have a 5 GB ZIP file stored in ADLS. When uncompressed, it expands to approximately 115 GB and contains multiple NDJSON files, each around 200 MB in size. We need to read this data and write it to a Delta table in Databricks on a weekly basis.W...

  • 858 Views
  • 1 replies
  • 0 kudos
Latest Reply
chetan-mali
Contributor
  • 0 kudos

Unzip the Archive FileApache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the dr...

  • 0 kudos
sachamourier
by Contributor
  • 1966 Views
  • 5 replies
  • 3 kudos

Resolved! Enable to use library GraphFrames

Hello,I am trying to install and use the library GraphFrames but keep receiving the following error: "AttributeError: 'SparkSession' object has no attribute '_sc'".I have tried to install the library on my all-purpose cluster (Access mode: Standard)....

  • 1966 Views
  • 5 replies
  • 3 kudos
Latest Reply
sachamourier
Contributor
  • 3 kudos

@szymon_dybczak Thanks for the responses. I indeed changed my all-purpose cluster access mode and it worked. I figured that was a nicest option than changing the runtime.

  • 3 kudos
4 More Replies
jar
by Contributor
  • 2218 Views
  • 2 replies
  • 0 kudos

Resolved! Use of Python variable in SQL cell

If using spark.conf.set(<variable_name>, <variable_value>), or just referring a widget value directly, in a Python cell and then referring to it in a SQL cell with ${variable_name} one gets the warning: "SQL query contains a dollar sign parameter, $p...

  • 2218 Views
  • 2 replies
  • 0 kudos
Latest Reply
jar
Contributor
  • 0 kudos

Frustrating indeed. Thank you, @lingareddy_Alva 

  • 0 kudos
1 More Replies
pavlosskev
by New Contributor III
  • 3448 Views
  • 1 replies
  • 0 kudos

Oracle JDBC Load Fails with Timestamp Partitioning (lowerBound/upperBound)

Hi everyone,I'm trying to read data from an Oracle database into Databricks using JDBC with timestamp-based partitioning. However, it seems that the partitioning doesn't work as expected when I specify lowerBound and upperBound using timestamp string...

  • 3448 Views
  • 1 replies
  • 0 kudos
Latest Reply
mani_22
Databricks Employee
  • 0 kudos

@pavlosskev Could you try adding the following option as well to your read? .option("sessionInitStatement", "ALTER SESSION SET NLS_TIMESTAMP_FORMAT = 'YYYY-MM-DD HH24:MI:SS'") df = ( spark.read.format("jdbc") .option("url", jdbcUrl) .opti...

  • 0 kudos
Sainath368
by Contributor
  • 1139 Views
  • 1 replies
  • 1 kudos

Resolved! E series vs F series VM's

Hi all,I need to run weekly maintenance on approximately 7,000 tables in my Databricks environment, involving OPTIMIZE, VACUUM, and ANALYZE TABLE (for statistics calculation) on all tables.My question is: between the Ev4, Edv4, and Fsv2 VM series, wh...

  • 1139 Views
  • 1 replies
  • 1 kudos
Latest Reply
mani_22
Databricks Employee
  • 1 kudos

@Sainath368  OPTIMIZE and VACUUM are compute-intensive operations, so you can choose a compute-optimized instance like the F series for both drivers and workers, which has a higher CPU-to-memory ratio. If its UC managed table, I recommend enabling Pr...

  • 1 kudos
Eyespoop
by New Contributor II
  • 30787 Views
  • 4 replies
  • 4 kudos

Resolved! PySpark: Writing Parquet Files to the Azure Blob Storage Container

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a f...

image image(1) image(2)
  • 30787 Views
  • 4 replies
  • 4 kudos
Latest Reply
amarv
New Contributor II
  • 4 kudos

This is my approach:from databricks.sdk.runtime import dbutils from pyspark.sql.types import DataFrame output_base_url = "abfss://..." def write_single_parquet_file(df: DataFrame, filename: str): print(f"Writing '{filename}.parquet' to ABFS") ...

  • 4 kudos
3 More Replies
yhu126
by New Contributor
  • 943 Views
  • 1 replies
  • 0 kudos

How to create a SparkSession in jobs run-unit-tests

I’m converting my Python unit tests to run with databricks jobs run-unit-tests.Each test needs a SparkSession, but every pattern I try What I tried1. Create my own local Sparkspark = (SparkSession.builder.master("local[*]").appName("unit-test").getOr...

  • 943 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @yhu126 ,Maybe below blog post give you some inspiration:Writing Unit Tests for PySpark in Databricks: Appr... - Databricks Community - 122398

  • 0 kudos
nkrom456
by New Contributor III
  • 2437 Views
  • 7 replies
  • 1 kudos

Resolved! Unable to resolve column error while trying to query the view

I have a federated table from snowflake in data bricks say employee.When i executed print schema i am able to see schema as "employeeid": long,"employeename":stringTried to create a view as create view vw_emp with schema binding as select `"employeei...

  • 2437 Views
  • 7 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @nkrom456 ,Try something like this. If you are using backticks it treats a column name exactly as you type (in this case it treats double quotes as a part of a colum name)create view vw_emp with schema binding as select `employeeid` from employee ...

  • 1 kudos
6 More Replies
RyHubb
by New Contributor III
  • 7155 Views
  • 6 replies
  • 1 kudos

Resolved! Databricks asset bundles job and pipeline

Hello, I'm looking to create a job which is linked to a delta live table.  Given the job code like this: my_job_name: name: thejobname schedule: quartz_cron_expression: 56 30 12 * * ? timezone_id: UTC pause_stat...

  • 7155 Views
  • 6 replies
  • 1 kudos
Latest Reply
Laurens1
New Contributor II
  • 1 kudos

This ended a frustrating search! Would be great to add this to the documentation instead of "go to portal and copy paste the id"!!!

  • 1 kudos
5 More Replies
noorbasha534
by Valued Contributor II
  • 661 Views
  • 1 replies
  • 2 kudos

Machine type for different operations in Azure Databricks

Dear alldo we have a general recommendation for the virtual machine type to be used for different operations in Azure Databricks? we are looking for the below -1. VACUUM 2. OPTIMIZE 3. ANALYZE STATS 4. DESCRIBE TABLE HISTORYI understood at a high lev...

  • 661 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @noorbasha534 ,Here's a general recommendation from Databricks. So they're recommending to run OPTIMIZE on compute optimized VMs and VACUUM on general purpose.Comprehensive Guide to Optimize Data Workloads | DatabricksBut as you said, VACCUM is co...

  • 2 kudos
xhead
by New Contributor II
  • 29521 Views
  • 15 replies
  • 3 kudos

Does "databricks bundle deploy" clean up old files?

I'm looking at this page (Databricks Asset Bundles development work tasks) in the Databricks documentation.When repo assets are deployed to a databricks workspace, it is not clear if the "databricks bundle deploy" will remove files from the target wo...

Data Engineering
bundle
cli
deploy
  • 29521 Views
  • 15 replies
  • 3 kudos
Latest Reply
ganapati
New Contributor III
  • 3 kudos

@JamesGraham this issue is related to "databricks bundle deploy" command itself, when run inside ci/cd pipeline, i am still seeing old configs in bundle.tf.json. Ideally it should be updated to changes done from previous run. But i am still seeing er...

  • 3 kudos
14 More Replies
Aidonis
by New Contributor III
  • 26011 Views
  • 4 replies
  • 4 kudos

Resolved! Load Data from Sharepoint Site to Delta table in Databricks

Hi New to the community so sorry if my post lacks detail.I am trying to create a connection between databricks and a sharepoint site to read excel files into a delta tableI can see there is a FiveTran partner connection that we can use to get sharepo...

  • 26011 Views
  • 4 replies
  • 4 kudos
Latest Reply
gaurav_singh_14
New Contributor II
  • 4 kudos

@Ajay-Pandey can we connect using user ID without using client id and secrets

  • 4 kudos
3 More Replies
Labels