cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Brad
by Contributor
  • 183 Views
  • 2 replies
  • 0 kudos

why delta log checkpoint is created in different formats

Hi,I'm using runtime 15.4 LTS or 14.3 LTS. When loading a delta lake table from Kinesis, I found the delta log checkpoint is in mixing formats like:7616 00000000000003291896.checkpoint.b1c24725-....json 7616 00000000000003291906.checkpoint.873e1b3e-....

  • 183 Views
  • 2 replies
  • 0 kudos
Latest Reply
Brad
Contributor
  • 0 kudos

Thanks. We use a job to load data from Kinesis to delta table. I added the spark.databricks.delta.checkpoint.writeFormat parquet spark.databricks.delta.checkpoint.writeStatsAsStruct truein job cluster, but the checkpoints still show different formats...

  • 0 kudos
1 More Replies
billykimber
by New Contributor
  • 97 Views
  • 1 replies
  • 0 kudos

Datamart creation

In a scenario where multiple teams access overlapping but not identical datasets from a shared data lake, is it better to create separate datamarts for each team (despite data redundancy) or to maintain a single datamart and use views for team-specif...

  • 97 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

IMO there is no single best scenario.It depends on the case I would say.  Both have pros and cons.If the difference between teams is really small, views could be a solution.But on the other hand, if you work on massive data, the views first have to b...

  • 0 kudos
pankajshaw
by New Contributor II
  • 166 Views
  • 2 replies
  • 3 kudos

Duplicates in CSV Export to ADLS

Hello everyone,I'm facing an issue when writing data in CSV format to Azure Data Lake Storage (ADLS). Before writing, there are no duplicates in the DataFrame, and all the records look correct. However, after writing the CSV files to ADLS, I notice d...

image.png
  • 166 Views
  • 2 replies
  • 3 kudos
Latest Reply
bhanu_gautam
New Contributor II
  • 3 kudos

@Kaniz_Fatma , Great explanation

  • 3 kudos
1 More Replies
L1000
by New Contributor III
  • 186 Views
  • 4 replies
  • 2 kudos

DLT Serverless incremental refresh of materialized view

I have a materialized view that always does a "COMPLETE_RECOMPUTE", but I can't figure out why.I found how I can get the logs:  SELECT * FROM event_log(pipeline_id) WHERE event_type = 'planning_information' ORDER BY timestamp desc;   And for my table...

  • 186 Views
  • 4 replies
  • 2 kudos
Latest Reply
L1000
New Contributor III
  • 2 kudos

I split up materialized view in 3 separate ones:step1:@Dlt.table(name="step1", table_properties={"delta.enableRowTracking": "true"}) def step1(): isolate_names = dlt.read("soruce").select("Name").groupBy("Name").count() return isolate_namesst...

  • 2 kudos
3 More Replies
RobDineen
by New Contributor II
  • 200 Views
  • 2 replies
  • 2 kudos

Resolved! %SQL delete from temp table driving me mad

Hello there, i have a temp table where i want to remove a null / empty values ( see below )if there are no rows to delete, then shouldn't it just say Zero rows affected? 

RobDineen_0-1729682455777.png
  • 200 Views
  • 2 replies
  • 2 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 2 kudos

@RobDineen This should answer your question: https://community.databricks.com/t5/get-started-discussions/how-to-create-temporary-table-in-databricks/m-p/67774/highlight/true#M2956Long story short, don't use it.

  • 2 kudos
1 More Replies
sandy311
by New Contributor III
  • 2160 Views
  • 7 replies
  • 3 kudos

Resolved! Databricks asset bundle does not create new job if I change configuration of existing Databricks yam

When deploying multiple jobs using the `Databricks.yml` file via the asset bundle, the process either overwrites the same job or renames it, instead of creating separate, distinct jobs.

  • 2160 Views
  • 7 replies
  • 3 kudos
Latest Reply
Ncolin1999
New Contributor II
  • 3 kudos

@filipniziol my requirements is to just deploy  notebooks in databricks workspace. I don’t not wana create any job. Can I still uses databricks asset bundle 

  • 3 kudos
6 More Replies
Tamizh035
by New Contributor II
  • 201 Views
  • 2 replies
  • 1 kudos

[INSUFFICIENT_PERMISSIONS] Insufficient privileges:

While reading csv file using spark and listing the files under a folder using data bricks utils, I am getting below error:[INSUFFICIENT_PERMISSIONS] Insufficient privileges: User does not have permission SELECT on any file. SQLSTATE: 42501File <comma...

  • 201 Views
  • 2 replies
  • 1 kudos
Latest Reply
Panda
Valued Contributor
  • 1 kudos

@Tamizh035 , If your file in dbfs or external location or local folder ?Use dbutils.fs.ls to verify that the path exists and you have access:files = dbutils.fs.ls("dbfs:/path_to_your_file/") display(files) 

  • 1 kudos
1 More Replies
Adrianj
by New Contributor III
  • 8342 Views
  • 12 replies
  • 8 kudos

Databricks Bundles - How to select which jobs resources to deploy per target?

Hello, My team and I are experimenting with bundles, we follow the pattern of having one main file Databricks.yml and each job definition specified in a separate yaml for modularization. We wonder if it is possible to select from the main Databricks....

  • 8342 Views
  • 12 replies
  • 8 kudos
Latest Reply
sergiopolimante
New Contributor II
  • 8 kudos

"This include array can appear only as a top-level mapping." - you can't use include inside targets. You can use sync - exclude to exclude the yml files, but if they are in the include the workflows are going to be created anyway, even if the yml fil...

  • 8 kudos
11 More Replies
Stephanos
by New Contributor
  • 448 Views
  • 1 replies
  • 0 kudos

Sequencing Job Deployments with Databricks Asset Bundles

Hello Databricks Community!I'm working on a project where I need to deploy jobs in a specific sequence using Databricks Asset Bundles. Some of my jobs (let's call them coordination jobs) depend on other jobs (base jobs) and need to look up their job ...

  • 448 Views
  • 1 replies
  • 0 kudos
Latest Reply
MohcineRouessi
New Contributor II
  • 0 kudos

Hey Steph, Have you found anything here please ? I'm currently stuck here, trying to achieve the same thing

  • 0 kudos
amelia1
by New Contributor II
  • 1020 Views
  • 2 replies
  • 0 kudos

pyspark read data using jdbc url returns column names only

Hello,I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I ...

  • 1020 Views
  • 2 replies
  • 0 kudos
Latest Reply
infodeliberatel
New Contributor II
  • 0 kudos

I added `UseNativeQuery=0` in url. It works for me.

  • 0 kudos
1 More Replies
gilt
by New Contributor
  • 104 Views
  • 1 replies
  • 0 kudos

Auto Loader ignores data with modifiedBefore

Hello, I am trying to ingest CSV data with Auto Loader from an Azure Data Lake. I want to perform batch ingestion by using a scheduled job and the following trigger:  .trigger(availableNow=True) The CSV files are generated by Azure Synapse Link. If m...

  • 104 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Valued Contributor III
  • 0 kudos

Hi @gilt,How are you doing today?As per my understanding, Consider adjusting the Auto Loader configuration since the modifiedBefore option seems to mark the file as processed during the first trigger, even if it’s incomplete. This behavior might be e...

  • 0 kudos
Phuonganh
by New Contributor II
  • 1249 Views
  • 2 replies
  • 4 kudos

Databricks SDK for Python: Errors with parameters for Statement Execution

Hi team,Im using Databricks SDK for python to run SQL queries. I created a variable as below:param = [{'name' : 'a', 'value' :x'}, {'name' : 'b', 'value' : 'y'}]and passed it the statement as below_ = w.statement_execution.execute_statement( warehous...

  • 1249 Views
  • 2 replies
  • 4 kudos
Latest Reply
jessica3
New Contributor II
  • 4 kudos

Has anyone found the solution to this? I am running into the same error

  • 4 kudos
1 More Replies
Rik
by New Contributor III
  • 5029 Views
  • 11 replies
  • 9 kudos

Resolved! File information is not passed to trigger job on file arrival

We are using the UC mechanism for triggering jobs on file arrival, as described here: https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/file-arrival-triggers.Unfortunately, the trigger doesn't actually pass the file-path that is gener...

Data Engineering
file arrival
trigger file
Unity Catalog
  • 5029 Views
  • 11 replies
  • 9 kudos
Latest Reply
artemich
New Contributor II
  • 9 kudos

Same here!Additionally would be great to enhance it to support not just the path to a directory, but also the prefix of the file name (or regex for bonus points). Right now if you have 10 types of files arriving to the same folder, it would be much c...

  • 9 kudos
10 More Replies
Abel_Martinez
by Contributor
  • 12510 Views
  • 9 replies
  • 10 kudos

Resolved! Why I'm getting connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x from Databricks

I'm able to connect to MongoDB using org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 and this code:df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", jdbcUrl)It works well, but if I install last MongoDB Spark Connector ve...

  • 12510 Views
  • 9 replies
  • 10 kudos
Latest Reply
ravisharma1024
New Contributor II
  • 10 kudos

I was facing the same issue, now It is resolved, and thanks to @Abel_Martinez.I am using this like below code:df = spark.read.format("mongodb") \.option('spark.mongodb.read.connection.uri', "mongodb+srv://*****:*****@******/?retryWrites=true&w=majori...

  • 10 kudos
8 More Replies
RobertWalsh
by New Contributor II
  • 8157 Views
  • 7 replies
  • 2 kudos

Resolved! Hive Table Creation - Parquet does not support Timestamp Datatype?

Good afternoon, Attempting to run this statement: %sql CREATE EXTERNAL TABLE IF NOT EXISTS dev_user_login ( event_name STRING, datetime TIMESTAMP, ip_address STRING, acting_user_id STRING ) PARTITIONED BY (date DATE) STORED AS PARQUET ...

  • 8157 Views
  • 7 replies
  • 2 kudos
Latest Reply
source2sea
Contributor
  • 2 kudos

1. change to spark native catalog approach (not hive metadata store) works. Syntax is essentially: CREATE TABLE IF NOT EXISTS dbName.tableName (columns names and types ) USING parquet PARTITIONED BY ( runAt STRING ) LOCA...

  • 2 kudos
6 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels