cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Brad
by Contributor II
  • 737 Views
  • 2 replies
  • 0 kudos

Can I have sequence guarantee when replicate with CDF

Hi team,I have a delta table src, and somehow I want to replicate it to another table tgt with CDF, sort of (spark .readStream .format("delta") .option("readChangeFeed", "true") .table('src') .writeStream .format("delta") ...

  • 737 Views
  • 2 replies
  • 0 kudos
Latest Reply
Brad
Contributor II
  • 0 kudos

Thanks. If the replicated table can have the _commit_version in strict sequence, I can take it as a global ever-incremental col and consume the delta of it (e.g. in batch way) with select * from replicated_tgt where _commit_version > ( selecct la...

  • 0 kudos
1 More Replies
jorgemarmol
by New Contributor II
  • 6910 Views
  • 10 replies
  • 2 kudos

Delta Live Tables: Too much time to do the "setting up"

Hello community!Recently I have been working in delta live table for a big project. My team and me have been studying so much and finally we have built a good pipeline with CDC that load 608 entities (and, therefore, 608 delta live tables and 608 mat...

jorgemarmol_0-1688633577282.png
  • 6910 Views
  • 10 replies
  • 2 kudos
Latest Reply
DataEngineer
New Contributor II
  • 2 kudos

Increase the worker and driver to higher configuraion on the pipeline. It will take initially for setting up but once the setup is completed, the ingestion would be faster. Here you can save the one hour took for ingestion.. 

  • 2 kudos
9 More Replies
camilo_s
by Contributor
  • 7271 Views
  • 10 replies
  • 9 kudos

Git credentials for service principals running Jobs

I know the documentation for setting up Git credentials for Service Principals: you have to use a PAT from your Git provider, which is inevitably tied to an user and has a lifecycle of its own.Doesn't this kind of defeats the purpose of running a job...

  • 7271 Views
  • 10 replies
  • 9 kudos
Latest Reply
clarkh
New Contributor II
  • 9 kudos

@nicole_lu_PMRunning into a similar issue with a job that needs to run in a service principal context and is connected to GitHub to execute a specific file. Would the work around be to create a PAT for GitHub under the service principal creds?

  • 9 kudos
9 More Replies
ms_221
by New Contributor II
  • 973 Views
  • 1 replies
  • 0 kudos

Need to load the data from databricks to Snowflake table having ID,which automatically increments

I want to load the data from  df (say 3 columns c1,c2,c3) into the snowflake table say (test1) having columns (c1,c2,c3) and ID autoincrement column.The df and snowflake table (test1) have same column definition and same datatypes. In the target tabl...

  • 973 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

To load data from a DataFrame into a Snowflake table with an autoincrement ID column, you can follow these steps: First, ensure that your Snowflake table (test1) is created with an autoincrement ID column:CREATE OR REPLACE TABLE test1 ( ID INT AU...

  • 0 kudos
Guigui
by New Contributor II
  • 1766 Views
  • 3 replies
  • 0 kudos

Job start time timezone

It is mentionned in the documentation that the job.start_time is a value of time in UTC timezone but I wonder if it's always the case because as the start_time is in UTC timezone for a scheduled job, it is in local timezone when it is manually trigge...

  • 1766 Views
  • 3 replies
  • 0 kudos
Latest Reply
Mounika_Tarigop
Databricks Employee
  • 0 kudos

To determine whether a Databricks job was triggered manually or by schedule, you can use the dynamic value reference {{job.trigger.type}}. T

  • 0 kudos
2 More Replies
RobDineen
by Contributor
  • 1900 Views
  • 4 replies
  • 0 kudos

Resolved! Pyspark to_date not coping with single digit Day or Month

Hi there i have a simple Pyspark To_date function but fails due to days or months from 1-9 sois there a nice easy way to get round this at allRegardsRob

RobDineen_0-1731324661487.png
  • 1900 Views
  • 4 replies
  • 0 kudos
Latest Reply
RobDineen
Contributor
  • 0 kudos

Resolved using format_string dff = df.withColumn("DayofMonthFormatted", when(df.DayofMonth.isin([1,2,3,4,5,6,7,8,9]), format_string("0%d", df.DayofMonth)).otherwise(df.DayofMonth))

  • 0 kudos
3 More Replies
Avinash_Narala
by Valued Contributor II
  • 2837 Views
  • 2 replies
  • 2 kudos

Fully serverless databricks SaaS

I'm exploring Databricks' fully serverless SaaS option, as shown in the attached image, which promises quick setup and $400 in initial credits. I'm curious about the pros and cons of using this fully serverless setup.Specifically, would this option b...

  • 2837 Views
  • 2 replies
  • 2 kudos
Latest Reply
gchandra
Databricks Employee
  • 2 kudos

There are; if you have spark config, customer jars, and init scripts, they won't work. Please check this page for long list of limitations. https://docs.databricks.com/en/compute/serverless/limitations.html

  • 2 kudos
1 More Replies
rcostanza
by New Contributor III
  • 2036 Views
  • 4 replies
  • 2 kudos

Resolved! Changing git's author field when committing through Databricks

I have a git folder to a Bitbucket repo. Whenever I commit something, the commit uses my Bitbucket username (the unique name) in the field author, making it less readable when I'm reading a list of commits.For example, commits end up like this: commi...

  • 2036 Views
  • 4 replies
  • 2 kudos
Latest Reply
yermulnik
New Contributor II
  • 2 kudos

Just found us suffering from the same issue since we enforced a GitHub ruleset to require commit emails to match our Org email pattern of `*@ourorgdomain.com`.

  • 2 kudos
3 More Replies
dfish8124
by New Contributor II
  • 1076 Views
  • 1 replies
  • 1 kudos

Streaming IoT Hub Data Using Delta Live Table Pipeline

Hello,I'm attempting to stream IoT Hub data using a delta live table pipeline.  The issue I'm running into is that eventhub streaming isn't supported on shared clusters ([UNSUPPORTED_STREAMING_SOURCE_PERMISSION_ENFORCED] Data source eventhubs is not ...

  • 1076 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @dfish8124 ,Is is possible to share with us code?

  • 1 kudos
Subhasis
by New Contributor III
  • 1340 Views
  • 2 replies
  • 0 kudos

Small json files issue . taking 2 hours to read 3000 files

Hello I am trying to read 3000 json files which has only one records. It is taking 2 hours to read all the files . How can I perform this operation faster pls suggest.

  • 1340 Views
  • 2 replies
  • 0 kudos
Latest Reply
Subhasis
New Contributor III
  • 0 kudos

This is the code ---df1 = spark.read.format("json").options(inferSchema="true", multiLine="true").load(file1) 

  • 0 kudos
1 More Replies
NemesisMF
by New Contributor II
  • 1267 Views
  • 4 replies
  • 2 kudos

Obtain refresh mode from within Delta Live Table pipeline run

Is it possible to obtain somehow if a DLT pipeline run is running in Full Refresh or incremental mode from within a notebook running in the pipeline?I looked into the pipeline configuration variables but could not find anything.It would be benefitial...

  • 1267 Views
  • 4 replies
  • 2 kudos
Latest Reply
NemesisMF
New Contributor II
  • 2 kudos

We found a solution where we do not need to determine the refresh mode anymore. But I still do not know how to get the current refresh mode of the current pipeline run from within a notebook that is running in the pipeline. This may would still be be...

  • 2 kudos
3 More Replies
SALP_STELLAR
by New Contributor
  • 1556 Views
  • 1 replies
  • 0 kudos

AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Server failed to a

Actually My first part of the code works fine:  dbutils.widgets.text("AutoLoanFilePath", "")inputPath = dbutils.widgets.get("AutoLoanFilePath")# inputPath = 'SEPT_2024/FAMILY_SECURITY'autoPath = 'dbfs:/mnt/dbs_adls_mnt/Prod_landing/'+inputPathautoLoa...

  • 1556 Views
  • 1 replies
  • 0 kudos
Latest Reply
SparkJun
Databricks Employee
  • 0 kudos

This looks like an authentication issue when trying to access Azure Blob Storage from your Databricks environment. Can you please check the storage credentials and the setup?  Consider using an Azure AD service principal with the appropriate RBAC rol...

  • 0 kudos
L1000
by New Contributor III
  • 561 Views
  • 1 replies
  • 0 kudos

How to detect gap in filenames (Autoloader)

So my files arrive at the cloud storage and I have configured an autoloader to read these files.The files have a monotonically increasing id in their name.How can I detect a gap and stop the DLT as soon as there is a gap?eg.Autoloader finds file1, in...

  • 561 Views
  • 1 replies
  • 0 kudos
Latest Reply
SparkJun
Databricks Employee
  • 0 kudos

It doesn't seem like this can be done through the DLT autoloader. Particularly you require an automatic stop without manual intervention. You can write a custom Structured Streaming job and use a sequence-checking logic, and foreachBatch to process i...

  • 0 kudos
rvo19941
by New Contributor II
  • 725 Views
  • 1 replies
  • 0 kudos

Auto Loader with File Notification mode not picking up new files in Delta Live Tables pipeline

Dear,I am developing a Delta Live Table pipeline and use Auto Loader with File Notification mode to pick up files inside an Azure storage account (which is not the storage used by the default catalog). When I full refresh the target streaming table, ...

rvo19941_0-1730383733629.png
  • 725 Views
  • 1 replies
  • 0 kudos
Latest Reply
SparkJun
Databricks Employee
  • 0 kudos

Based on the error "Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key", the pipeline was still trying to use an account key authentication method instead of service principal au...

  • 0 kudos
dipali_globant
by New Contributor II
  • 1181 Views
  • 2 replies
  • 0 kudos

parsing json string value column into dataframe structure

Hi All,I have to read kafka payload which has value column with json string. But the format of the json is as below.{ "data": [ { "p_al4": "N/A", "p_a5": "N/A", "p_ad": "OA003", "p_aName": "Abc", "p_aFlag": true ,....(dynamic)} ] }In data key it can ...

  • 1181 Views
  • 2 replies
  • 0 kudos
Latest Reply
dipali_globant
New Contributor II
  • 0 kudos

No I don't know element in JSON . so I can't define structure.

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels