Hi @Milliman, In Databricks, you can automate the re-run of a job if any of its associated tasks fail.
Here are some steps to achieve this:
Conditional Task Execution:
You can specify “Run if dependencies” to run a task based on the run status o...
Hi,Does anyone know how to link Aurora to Databricks directly and load data into Databricks automatically on a schedule without any third-party tools in the middle?
Hi @creditorwatch, To ingest data into Databricks directly from Amazon Aurora and automate the process on a schedule, you have a few options.
Let’s explore them:
Auto Loader (Recommended):
Auto Loader is a powerful feature in Databricks that eff...
In Azure Databricks the DBFS storage account is open to all networks. Changing that to use a private endpoint or minimizing access to selected networks is not allowed.Is there any way to add network security to this storage account? Alternatively, is...
How can we secure the storage account in the managed resource group which holds the DBFS with restricted network access, since access from all networks is blocked by our Azure storage account policy?
I have a medallion architecture: Bronze layer: Raw data in tablesSilver layer: Refined data in views created from the bronze layerGold layer: Data products as views created from the silver layerCurrently I have a data scientist that needs access to d...
Single-user clusters use a different security mode which is the reason for this difference.
On single-user/assigned clusters, you'll need the Fine Grained Access Control service (which is a Serverless service) - that is the solution to this problem (...
I'm trying to addmonotonicallyIncreasingId() column to a streaming table and I see the following errorFailed to start stream [table_name] in either append mode or complete mode.
Append mode error: Expression(s): monotonically_increasing_id() is not s...
Hi team, When I create a DLT job, is there a way to control the cluster runtime version somewhere? E.g. I want to use 14.3 LTS. I tried to add `"spark_version": "14.3.x-scala2.12",` inside cluster default label but not work.Thanks
Thanks. Got it.And the cluster has to be share mode. Can different DLT jobs share clusters or when DLT job is running, can other people use the cluster? Seems each DLT job running will start a new cluster. If it is not be able to shared, why it has t...
Hi Team,Can we pass Delta Live Table name dynamically [from a configuration file, instead of hardcoding the table name]? We would like to build a metadata-driven pipeline.
I am observing same error while I adding dataset.tablename. org.apache.spark.sql.catalyst.ExtendedAnalysisException: Materializing tables in custom schemas is not supported. Please remove the database qualifier from table 'streaming.dlt_read_test_fil...
Can someone explain why this below code is throwing an error? My intuition is telling me it's my spark version (3.2.1) but would like confirmation:d = {'key':['a','a','c','d','e','f','g','h'],
'data':[1,2,3,4,5,6,7,8]}
x = ps.DataFrame(d)
x[x['...
@pjp94 - The error indicates the pandas pyspark implementation does not have the below method implemented.
pd.Series.duplicated()
Next steps is to use dataframe methods such as distinct, groupBy, dropDuplicates to resolve this.
TimeoutException: Stream Execution thread for stream [id = xxx runId = xxxx] failed to stop within 15000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.I have a data...
@User_1611 - could you please try the following ?
Reduce the number of streaming queries running on the same clusterMake sure your code does not try to re-trigger/start an active streaming queryMake sure to collect the thread dumps if this error hap...
I have 50k + parquet files in the in azure datalake and i have mount point as well. I need to read all the files and load into a dataframe. i have around 2 billion records in total and all the files are not having all the columns, column order may di...
@Shan1 - This could be due to the files have cols that differ by data type. Eg. Integer vs long , Boolean vs integer. can be resolved by schemaMerge=False. Please refer to this code. https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31e...
Hi everyone,I am using DBR version 13 and Managed tables in a custom catalog location of table is AWS S3.running notebook on single user clusterI am facing MalformedInputException while saving data to Tables or reading it.When I am running my noteboo...
@Kaniz The issue is resolved as soon as I deployed it to mutlinode dev cluster.Issue is only occurring in single user clusters. Looks like limitation of running all updates in one node as distributed system.
There is no resource to create All Purpose Cluster, but I need it, so does it mean I should create it via Terraform or DBX and reference to it, which I dont prefer?
Is there a way to get a child Job Run status and show the result within the parent notebook execution?Here is the case: I have a master notebook and several child notebooks. As a result, I want to see which notebook failed: For example Notebook job s...
Hey there! Thanks a bunch for being part of our awesome community! We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution...
I have a notebook where I read multiple tables from delta lake (let say schema is db) and after that I did some sort of transformation (image enclosed) using all these tables lwith transformations like join,filter etc. After transformation and writin...
I have a DLT with a table that I want to contain the running aggregation (for the sake of simplicitly let's assume it's a count) for each value of some key column, using a session window. The input table goes back several years and to clean up aggreg...
Hi @vroste ,
• To configure the update output mode for running aggregation in Delta Live Tables (DLT), use the outputMode option when writing the DLT table.• By default, DLT writes data in complete mode, which outputs the complete result table after...