cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

confused_dev
by New Contributor II
  • 42603 Views
  • 7 replies
  • 5 kudos

Python mocking dbutils in unittests

I am trying to write some unittests using pytest, but I am coming accross the problem of how to mock my dbutils method when dbutils isn't being defined in my notebook.Is there a way to do this so that I can unit test individual functions that are uti...

  • 42603 Views
  • 7 replies
  • 5 kudos
Latest Reply
pavlosskev
New Contributor III
  • 5 kudos

Fermin_vicente's answer is pretty good already. Below is how you can do something similar with conftest.py# conftest.py import pytest from unittest.mock import MagicMock from pyspark.sql import SparkSession @pytest.fixture(scope="session") def dbuti...

  • 5 kudos
6 More Replies
johnb1
by Contributor
  • 34007 Views
  • 16 replies
  • 15 kudos

Problems with pandas.read_parquet() and path

I am doing the "Data Engineering with Databricks V2" learning path.I cannot run "DE 4.2 - Providing Options for External Sources", as the first code cell does not run successful:%run ../Includes/Classroom-Setup-04.2Screenshot 1: Inside the setup note...

MicrosoftTeams-image MicrosoftTeams-image (1) Capture Capture_2
  • 34007 Views
  • 16 replies
  • 15 kudos
Latest Reply
hebied
New Contributor II
  • 15 kudos

Thanks for sharing bro ..It really helped.

  • 15 kudos
15 More Replies
SRK
by Contributor III
  • 5620 Views
  • 5 replies
  • 7 kudos

How to handle schema validation for Json file. Using Databricks Autoloader?

Following are the details of the requirement:1.      I am using databricks notebook to read data from Kafka topic and writing into ADLS Gen2 container i.e., my landing layer.2.      I am using Spark code to read data from Kafka and write into landing...

  • 5620 Views
  • 5 replies
  • 7 kudos
Latest Reply
maddy08
New Contributor II
  • 7 kudos

just to clarify, are you reading kafka and writing into adls in json files? like for each message from kafka is 1 json file in adls ?

  • 7 kudos
4 More Replies
rubenesanchez
by New Contributor II
  • 7936 Views
  • 6 replies
  • 0 kudos

How dynamically pass a string parameter to a Delta Live Table Pipeline when calling from Azure Data Factory using REST API

I want to pass some context information to the delta live tables pipeline when calling from Azure Data Factory. I know the body of the API call supports Full Refresh parameter but I wonder if I can add my own custom parameters and how this can be re...

  • 7936 Views
  • 6 replies
  • 0 kudos
Latest Reply
BLM
New Contributor II
  • 0 kudos

In case this helps anyone, I only could use the refresh_selection parameter setting it to [] by default. Then, in the notebook, I derived the custom parameter values from the refresh_selection value.

  • 0 kudos
5 More Replies
NickCBZ
by New Contributor III
  • 1301 Views
  • 1 replies
  • 0 kudos

Resolved! AWS Config price raised after change to Job Compute

I was looking for opportunities to decrease the cost of my Databricks ETLs and, following the documentation, I started to use Job Computes on my ETLs.Earlier, I used only all-purpose compute to do the ETLs because I needed them to run every 10minutes...

  • 1301 Views
  • 1 replies
  • 0 kudos
Latest Reply
NickCBZ
New Contributor III
  • 0 kudos

If someone have this problem in the future, the solution is simple: just disable AWS Config. That's all.

  • 0 kudos
Confused
by New Contributor III
  • 54305 Views
  • 6 replies
  • 3 kudos

Resolved! Configuring pip index-url and using artifacts-keyring

Hi I would like to use the azure artifact feed as my default index-url when doing a pip install on a Databricks cluster. I understand I can achieve this by updating the pip.conf file with my artifact feed as the index-url. Does anyone know where i...

  • 54305 Views
  • 6 replies
  • 3 kudos
Latest Reply
murtazahzaveri
New Contributor II
  • 3 kudos

For Authentication you can provide below config on cluster's Spark Environment Variables,PIP_EXTRA_INDEX_URL=https://username:password@pkgs.sample.com/sample/_packaging/artifactory_name/pypi/simple/.Also, you can store the value in Databricks secret

  • 3 kudos
5 More Replies
Brad
by Contributor II
  • 2089 Views
  • 2 replies
  • 0 kudos

why delta log checkpoint is created in different formats

Hi,I'm using runtime 15.4 LTS or 14.3 LTS. When loading a delta lake table from Kinesis, I found the delta log checkpoint is in mixing formats like:7616 00000000000003291896.checkpoint.b1c24725-....json 7616 00000000000003291906.checkpoint.873e1b3e-....

  • 2089 Views
  • 2 replies
  • 0 kudos
Latest Reply
Brad
Contributor II
  • 0 kudos

Thanks. We use a job to load data from Kinesis to delta table. I added the spark.databricks.delta.checkpoint.writeFormat parquet spark.databricks.delta.checkpoint.writeStatsAsStruct truein job cluster, but the checkpoints still show different formats...

  • 0 kudos
1 More Replies
billykimber
by New Contributor
  • 694 Views
  • 1 replies
  • 0 kudos

Datamart creation

In a scenario where multiple teams access overlapping but not identical datasets from a shared data lake, is it better to create separate datamarts for each team (despite data redundancy) or to maintain a single datamart and use views for team-specif...

  • 694 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

IMO there is no single best scenario.It depends on the case I would say.  Both have pros and cons.If the difference between teams is really small, views could be a solution.But on the other hand, if you work on massive data, the views first have to b...

  • 0 kudos
pankajshaw
by New Contributor II
  • 1427 Views
  • 2 replies
  • 3 kudos

Duplicates in CSV Export to ADLS

Hello everyone,I'm facing an issue when writing data in CSV format to Azure Data Lake Storage (ADLS). Before writing, there are no duplicates in the DataFrame, and all the records look correct. However, after writing the CSV files to ADLS, I notice d...

image.png
  • 1427 Views
  • 2 replies
  • 3 kudos
Latest Reply
bhanu_gautam
Valued Contributor III
  • 3 kudos

@Kaniz_Fatma , Great explanation

  • 3 kudos
1 More Replies
L1000
by New Contributor III
  • 1848 Views
  • 4 replies
  • 2 kudos

DLT Serverless incremental refresh of materialized view

I have a materialized view that always does a "COMPLETE_RECOMPUTE", but I can't figure out why.I found how I can get the logs:  SELECT * FROM event_log(pipeline_id) WHERE event_type = 'planning_information' ORDER BY timestamp desc;   And for my table...

  • 1848 Views
  • 4 replies
  • 2 kudos
Latest Reply
L1000
New Contributor III
  • 2 kudos

I split up materialized view in 3 separate ones:step1:@Dlt.table(name="step1", table_properties={"delta.enableRowTracking": "true"}) def step1(): isolate_names = dlt.read("soruce").select("Name").groupBy("Name").count() return isolate_namesst...

  • 2 kudos
3 More Replies
RobDineen
by Contributor
  • 2233 Views
  • 2 replies
  • 2 kudos

Resolved! %SQL delete from temp table driving me mad

Hello there, i have a temp table where i want to remove a null / empty values ( see below )if there are no rows to delete, then shouldn't it just say Zero rows affected? 

RobDineen_0-1729682455777.png
  • 2233 Views
  • 2 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

@RobDineen This should answer your question: https://community.databricks.com/t5/get-started-discussions/how-to-create-temporary-table-in-databricks/m-p/67774/highlight/true#M2956Long story short, don't use it.

  • 2 kudos
1 More Replies
sandy311
by New Contributor III
  • 12893 Views
  • 7 replies
  • 4 kudos

Resolved! Databricks asset bundle does not create new job if I change configuration of existing Databricks yam

When deploying multiple jobs using the `Databricks.yml` file via the asset bundle, the process either overwrites the same job or renames it, instead of creating separate, distinct jobs.

  • 12893 Views
  • 7 replies
  • 4 kudos
Latest Reply
Ncolin1999
New Contributor II
  • 4 kudos

@filipniziol my requirements is to just deploy  notebooks in databricks workspace. I don’t not wana create any job. Can I still uses databricks asset bundle 

  • 4 kudos
6 More Replies
Stephanos
by New Contributor
  • 1925 Views
  • 1 replies
  • 0 kudos

Sequencing Job Deployments with Databricks Asset Bundles

Hello Databricks Community!I'm working on a project where I need to deploy jobs in a specific sequence using Databricks Asset Bundles. Some of my jobs (let's call them coordination jobs) depend on other jobs (base jobs) and need to look up their job ...

  • 1925 Views
  • 1 replies
  • 0 kudos
Latest Reply
MohcineRouessi
New Contributor II
  • 0 kudos

Hey Steph, Have you found anything here please ? I'm currently stuck here, trying to achieve the same thing

  • 0 kudos
amelia1
by New Contributor II
  • 2378 Views
  • 2 replies
  • 0 kudos

pyspark read data using jdbc url returns column names only

Hello,I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I ...

  • 2378 Views
  • 2 replies
  • 0 kudos
Latest Reply
infodeliberatel
New Contributor II
  • 0 kudos

I added `UseNativeQuery=0` in url. It works for me.

  • 0 kudos
1 More Replies
RobertWalsh
by New Contributor II
  • 11229 Views
  • 7 replies
  • 2 kudos

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

Good afternoon, Attempting to run this statement: %sql CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login ( event_name STRING, datetime TIMESTAMP, ip_address STRING, acting_user_id STRING ) PARTITIONED BY (date DATE) STORED AS PARQUET ...

  • 11229 Views
  • 7 replies
  • 2 kudos
Latest Reply
source2sea
Contributor
  • 2 kudos

1. change to spark native catalog approach (not hive metadata store) works. Syntax is essentially: CREATE TABLE IF NOT EXISTS dbName.tableName (columns names and types ) USING parquet PARTITIONED BY ( runAt STRING ) LOCA...

  • 2 kudos
6 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels