cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

ksenija
by Contributor
  • 4182 Views
  • 5 replies
  • 5 kudos

How to change cluster size using a script

I want to change instance type or number of max workers via a python script. Does anyone know how to do it/is it possible? I have a lot of background jobs when I want to scale down my workers, so autoscaling is not an option. I was getting an error t...

  • 4182 Views
  • 5 replies
  • 5 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 5 kudos

Hi ksenija, this is just my guess but maybe you are using Cluster Policy in your cluster that only allows you to use specific cluster size ? E.g. below cluster policy that limits to some cluster sizes only.  

  • 5 kudos
4 More Replies
SamGreene
by Contributor
  • 4539 Views
  • 6 replies
  • 3 kudos

Change DLT table type from streaming to 'normal'

I have a DLT streaming live table, and after watching a QA session, I saw that it is advised to only use streaming tables for your raw landing.  I attempted to modify my pipeline to have my silver table be a regular LIVE TABLE, but an error was throw...

  • 4539 Views
  • 6 replies
  • 3 kudos
Latest Reply
quakenbush
Contributor
  • 3 kudos

Just curious, could you point me to said QA session if it's a video or something? I'm not aware of such a limitation. You can use DLT's live streaming tables anywhere in the Medallion architecture, just make sure not to break stream composability by ...

  • 3 kudos
5 More Replies
quakenbush
by Contributor
  • 2125 Views
  • 2 replies
  • 0 kudos

Delta Lake, CFD & SCD2

HiWhat's the best way to deal with SCD2-styled tables in silver and/or gold layer while streaming.From what I've seen in the Professional Data Engineer videos, they usually go for SCD1 tables (simple updates or deletes)In a SCD2 scenario, we need to ...

  • 2125 Views
  • 2 replies
  • 0 kudos
Latest Reply
quakenbush
Contributor
  • 0 kudos

I did some further reading and got the same conclusion. APPLY CHANGES might to the trick. However, I don't like the limitations. From Bronze to Silver I might need .foreachBatch to implement the JSON-logic and the attribute names (__start_at / __end_...

  • 0 kudos
1 More Replies
lena1
by New Contributor
  • 1025 Views
  • 1 replies
  • 0 kudos

Resource exhaustion when using default apply_changes python functionality

Hello!We are currently setting up streaming  CDC pipelines for more than 500 tables. Due to the high number of tables, we split our tables into multiple pipelines, we use multiple DLT pipelines per layer: bronze, silver goldIn silver, we only upsert ...

  • 1025 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

Hi Lena1,there is no magic behind the scene.If you write readstream from bronze table and writestream with ForEachBatch(function) and in function you will write MERGE stattemnt this will have similiar performance.Maybe there is a lot of shuffeling ha...

  • 0 kudos
Long_Tran
by New Contributor
  • 1669 Views
  • 1 replies
  • 0 kudos

Can job 'run_as' be assigned to users/principals who actually run it?

Can job 'run_as' be assigned to users/principals who actually run it? instead of always a fixed creator/user/pricipal?When a job is run, I would like to see in the job setting "run_as" the name of the actual user/principal who runs it.Currently, "run...

  • 1669 Views
  • 1 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

This is not avaliable in Workflow/Jobs.Job should newer be run as person who is executing the job, especialy in Production.The reason is that the output might be not the same, base on person who is running the job (e.g. diffrent Row Level Access). If...

  • 0 kudos
esauesp_co
by New Contributor III
  • 5400 Views
  • 5 replies
  • 1 kudos

Resolved! My jobs and cluster were deleted in a suspicious way

I want to know what happen with my cluster and if I can recover it.I entered to my Databricks account and I didn't found my jobs and my cluster. I couldn't find any log of the deleted cluster because the log is into the cluster interface. I entered t...

  • 5400 Views
  • 5 replies
  • 1 kudos
Latest Reply
Sid_databricks
New Contributor II
  • 1 kudos

Dear folks,When the tables has been deleted, then why I am unable to create the table with same name.It continiously giving me error"DeltaAnalysisException: Cannot create table ('`spark_catalog`.`default`.`Customer_Data`'). The associated location ('...

  • 1 kudos
4 More Replies
feed
by New Contributor III
  • 8988 Views
  • 6 replies
  • 3 kudos

TesseractNotFoundError

TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information. in databricks

  • 8988 Views
  • 6 replies
  • 3 kudos
Latest Reply
neha_ayodhya
New Contributor II
  • 3 kudos

%sh apt-get install -y tesseract-ocr this command is not working in my new Databricks free trail account, earlier it worked fine in my old Databricks instance. I get below error: E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Per...

  • 3 kudos
5 More Replies
MattPython
by New Contributor
  • 21693 Views
  • 4 replies
  • 0 kudos

How do you read files from the DBFS with OS and Pandas Python libraries?

I created translations for decoded values and want to save the dictionary object the DBFS for mapping. However, I am unable to access the DBFS without using dbutils or PySpark library. Is there a way to access the DBFS with OS and Pandas Python libra...

image.png image image image
  • 21693 Views
  • 4 replies
  • 0 kudos
Latest Reply
User16789202230
Databricks Employee
  • 0 kudos

db_path = 'file:///Workspace/Users/l<xxxxx>@databricks.com/TITANIC_DEMO/tested.csv' df = spark.read.csv(db_path, header = "True", inferSchema="True")

  • 0 kudos
3 More Replies
SimonXu
by New Contributor II
  • 9604 Views
  • 6 replies
  • 15 kudos

Resolved! Failed to launch pipeline cluster

Hi, there. I encountered an issue when I was trying to create my delta live table pipeline. The error is "DataPlaneException: Failed to launch pipeline cluster 1202-031220-urn0toj0: Could not launch cluster due to cloud provider failures. azure_error...

cluster failed to start usage and quota
  • 9604 Views
  • 6 replies
  • 15 kudos
Latest Reply
arpit
Databricks Employee
  • 15 kudos

@Simon Xu​ I suspect that DLT is trying to grab some machine types that you simply have zero quota for in your Azure account. By default, below machine type gets requested behind the scenes for DLT:AWS: c5.2xlargeAzure: Standard_F8sGCP: e2-standard-8...

  • 15 kudos
5 More Replies
shagun
by New Contributor III
  • 5426 Views
  • 3 replies
  • 0 kudos

Resolved! Delta live tables target schema

The first time i run my delta live table pipeline after setup, I get this error on starting it :-------------------------------------org.apache.spark.sql.catalyst.parser.ParseException: Possibly unquoted identifier my-schema-name detected. Please con...

  • 5426 Views
  • 3 replies
  • 0 kudos
Latest Reply
BenTendo
New Contributor II
  • 0 kudos

This still errors on internal databricks spark/python code likedeltaTable.history()@shagun wrote:The first time i run my delta live table pipeline after setup, I get this error on starting it :-------------------------------------org.apache.spark.sql...

  • 0 kudos
2 More Replies
ashdam
by New Contributor III
  • 4339 Views
  • 1 replies
  • 2 kudos

Databricks asset bundles use cluster depending on target (environment) is possible?

Here is my bundle definition  Spoiler# This is a Databricks asset bundle definition for my_project.# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.experimental:   python_wheel_wrapper: truebundle:  name: my_projectinc...

Data Engineering
Databricks Asset Bundles
  • 4339 Views
  • 1 replies
  • 2 kudos
Latest Reply
" src="" />
This widget could not be displayed.
This widget could not be displayed.
This widget could not be displayed.
  • 2 kudos

This widget could not be displayed.
Here is my bundle definition  Spoiler# This is a Databricks asset bundle definition for my_project.# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.experimental:   python_wheel_wrapper: truebundle:  name: my_projectinc...

This widget could not be displayed.
  • 2 kudos
This widget could not be displayed.
Nastasia
by New Contributor II
  • 4279 Views
  • 2 replies
  • 1 kudos

Why is Spark creating multiple jobs for one action?

I noticed that when launching this bunch of code with only one action, I have three jobs that are launched.from pyspark.sql import DataFrame from pyspark.sql.types import StructType, StructField, StringType from pyspark.sql.functions import avgdata:...

https___i.stack.imgur.com_xfYDe.png LTHBM DdfHN
  • 4279 Views
  • 2 replies
  • 1 kudos
Latest Reply
RKNutalapati
Valued Contributor
  • 1 kudos

The above code will create two jobs.JOB-1. dataframe: DataFrame = spark.createDataFrame(data=data,schema=schema)The createDataFrame function is responsible for inferring the schema from the provided data or using the specified schema.Depending on the...

  • 1 kudos
1 More Replies
bathulaj
by New Contributor
  • 581 Views
  • 0 replies
  • 0 kudos

NonDeterministic UDF's making multiple invocations

Hi,Even after defining my UDF's as nonDeterministic like in here ..testUDF = udf(testMthod, StringType()).asNondeterministic()is still making multiple invocations. Is there any thing that I am missing here ?TIA,-John B

  • 581 Views
  • 0 replies
  • 0 kudos
famous_jt33
by New Contributor
  • 1486 Views
  • 2 replies
  • 2 kudos

SQL UDFs for DLT pipelines

I am trying to implement a UDF for a DLT pipeline. I have seen the documentation stating that it is possible but I am getting an error after adding an SQL UDF to a cell in the notebook attached to the pipeline. The aim is to have the UDF in a separat...

  • 1486 Views
  • 2 replies
  • 2 kudos
Latest Reply
6502
New Contributor III
  • 2 kudos

You can't. The SQL support on DLT pipeline cluster is limited compared to a normal notebook. You can still define a UDF in Python using, of course, a Python notebook. In this case, you can use the spark.sql() function to execute your original SQL cod...

  • 2 kudos
1 More Replies
6502
by New Contributor III
  • 997 Views
  • 0 replies
  • 0 kudos

UDF already defined error when using it into a DLT pipeline

I'm using Unity catalog and defined some UDFs in my catalog.database, as reported by show functions in main.default main.default.getgender main.default.tointlist main.default.tostrlistI can use them from a start warehouse pro:SELECT main.default.get_...

  • 997 Views
  • 0 replies
  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels