cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ck7007
by New Contributor II
  • 51 Views
  • 3 replies
  • 3 kudos

Advanced Technique

Reduced Monthly Databricks Bill from $47K to $12.7KThe Problem: We were scanning 2.3TB for queries needing only 8GB of data.Three Quick Wins1. Multi-dimensional Partitioning (30% savings)# Beforedf.write.partitionBy("date").parquet(path)# After-parti...

  • 51 Views
  • 3 replies
  • 3 kudos
Latest Reply
BS_THE_ANALYST
Honored Contributor III
  • 3 kudos

@ck7007 no worries. I asked a question on the other thread: https://community.databricks.com/t5/data-engineering/cost/td-p/130078 , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.I didn't see you mention ...

  • 3 kudos
2 More Replies
Pratikmsbsvm
by Contributor
  • 70 Views
  • 2 replies
  • 0 kudos

Read Files from Adobe and Push to Delta table ADLS Gen2

The Upstream is sending 2 files of different schema. The Storage Account has Private Endpoints. there is no public access.no public IP (NPIP) = yes.How to design using only Databricks :-1. Databricks API to read data file from Adobe and Push it to AD...

Pratikmsbsvm_0-1756741451588.png
  • 70 Views
  • 2 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Pratikmsbsvm ,Okay, since you’re going to use Databricks compute for data extraction and you wrote that your workspace is deployed with the secure connectivity cluster (NPIP) option enabled, you first need to make sure that you have a stable egre...

  • 0 kudos
1 More Replies
brian999
by Contributor
  • 3356 Views
  • 5 replies
  • 2 kudos

Resolved! Managing libraries in workflows with multiple tasks - need to configure a list of libs for all tasks

I have workflows with multiple tasks, each of which need 5 different libraries to run. When I have to update those libraries, I have to go in and make the update in each and every task. So for one workflow I have 20 different places where I have to g...

  • 3356 Views
  • 5 replies
  • 2 kudos
Latest Reply
brian999
Contributor
  • 2 kudos

Actually I think I found most of a solution here in one of the replies: https://community.databricks.com/t5/administration-architecture/installing-libraries-on-job-clusters/m-p/37365/highlight/true#M245It seems like I only have to define libs for the...

  • 2 kudos
4 More Replies
IONA
by New Contributor
  • 66 Views
  • 3 replies
  • 2 kudos

Getting data from the Spark query profiler

When you navigate to Compute > Select Cluster > Spark UI > JDBC/ODBC There you can see grids of Session stats and SQL stats. Is there any way to get this data in a query so that I can do some analysis? Thanks

  • 66 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

 Hi @IONA ,As @BigRoux  correctly suggested there no native way to get stats from JDBC/ODBC Spark UI.1. You can try to use query history system table, but it has limited number of metrics %sql SELECT * FROM system.query.history 2. You can use /api/2....

  • 2 kudos
2 More Replies
der
by New Contributor III
  • 164 Views
  • 1 replies
  • 0 kudos

DBR 17.1 Spatial SQL Functions and Apache Sedona

I noticed in the DBR 17.1 release notes that ST geospatial functions are now in public preview - great news for us since this means native support in Databricks.https://docs.databricks.com/aws/en/release-notes/runtime/17.1#expanded-spatial-sql-expres...

  • 164 Views
  • 1 replies
  • 0 kudos
Latest Reply
der
New Contributor III
  • 0 kudos

@mjohns @dbkent do you know more about Apache Sedona and DBR SQL Spatial Functions?

  • 0 kudos
absan
by New Contributor II
  • 253 Views
  • 3 replies
  • 2 kudos

Lakeflow Connect SchemaParseException: Illegal character

Hi,i'm trying to setup Lakeflow Connect for SQL Server. The created gateway is failing with "org.apache.avro.SchemaParseException: Illegal character in: LN.FWH-ID"Unfortunately, don't have control over the source database to change the column names.I...

  • 253 Views
  • 3 replies
  • 2 kudos
Latest Reply
hippo
New Contributor II
  • 2 kudos

Okay so there are a few options I could find:1. Create a store procedure on that when creating CDC tables, which creates an intermediary clean table and runs CDC off of that. That table can use triggers to keep data in sync (better for lower volume)....

  • 2 kudos
2 More Replies
guilhermecs001
by Visitor
  • 52 Views
  • 1 replies
  • 2 kudos

How to work with 300 billions rows and 5 columns?

Hi guys!I'm having a problem at work where I need to process a customer data dataset with 300 billion rows and 5 columns. The transformations I need to perform are "simple," like joins to assign characteristics to customers. And at the end of the pro...

  • 52 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @guilhermecs001 ,Wow, that's massive amount of rows. Can you somehow preprocess first this huge CSV file? For example, read CSV, partition by some columns that makes sense (maybe country from which customer is coming from) and save that data as de...

  • 2 kudos
ManojkMohan
by Contributor III
  • 163 Views
  • 11 replies
  • 9 kudos

Ingesting 100 TB raw CSV data into the Bronze layer in Parquet + Snappy

Problem i am trying to solve:Bronze is the landing zone for immutable, raw data.At this stage, i am trying to sse a columnar format (Parquet or ORC) → good compression, efficient scans. and then apply lightweight compression (e.g., Snappy) → balances...

  • 163 Views
  • 11 replies
  • 9 kudos
Latest Reply
ManojkMohan
Contributor III
  • 9 kudos

At all thanks for all your suggestions , trying the optimal next steps based on these responses, will have an update here with screen shots soon

  • 9 kudos
10 More Replies
felix4572
by New Contributor
  • 107 Views
  • 6 replies
  • 2 kudos

transformWithStateInPandas throws "Spark connect directory is not ready" error

Hello,we employ arbitrary stateful aggregations in our data processing streams on Azure Databricks, and would like to migrate from applyInPandasWithState to transformWithStateInPandas. We employ the Python API throughout our solution, and some of our...

felix4572_0-1756710186921.png
  • 107 Views
  • 6 replies
  • 2 kudos
Latest Reply
Advika
Databricks Employee
  • 2 kudos

Hello @felix4572! Could you please share the driver log, or even better, the executor log (without any sensitive details)?

  • 2 kudos
5 More Replies
DataDev
by Visitor
  • 61 Views
  • 4 replies
  • 3 kudos

Schedule databricks job based on custom calendar

I want to schedule the databricks jobs based on the custom calender, like skip the job run on random days or holidays.#databricks @DataBricks @DATA 

  • 61 Views
  • 4 replies
  • 3 kudos
Latest Reply
Pilsner
Contributor
  • 3 kudos

Hello @DataDev Nice idea, I haven't thought about this before, but I like the suggestion.If I had to implement a custom schedule, there are two ways that come to mind.Firstly, if the schedule is relatively regular, with just an occasional day missed,...

  • 3 kudos
3 More Replies
victorNilsson
by New Contributor II
  • 46 Views
  • 1 replies
  • 1 kudos

Read polars from recently created csv file

More and more python packages transition to use polars instead of e.g. pandas. There is a problem with this in databricks when trying to read a csv file with it using pl.read_csv("filename.csv") when the file has been created in the same notebook cel...

victorNilsson_1-1756734657452.png
Data Engineering
csv
file system
OSError
polars
  • 46 Views
  • 1 replies
  • 1 kudos
Latest Reply
Pilsner
Contributor
  • 1 kudos

Hello @victorNilsson I have tried to replicate this issue on my end, but unfortunately was unsuccessful as it worked the first time for me. I have, however, still tried to search for a solution.I believe the issue you are getting could be linked to t...

  • 1 kudos
Sainath368
by New Contributor III
  • 80 Views
  • 1 replies
  • 0 kudos

Is Photon Acceleration Helpful for All Maintenance Tasks (OPTIMIZE, VACUUM, ANALYZE_COMPUTE_STATS)?

Hi everyone,We’re currently reviewing the performance impact of enabling Photon acceleration on our Databricks jobs, particularly those involving table maintenance tasks. Our job includes three main operations: OPTIMIZE, VACUUM, and ANALYZE_COMPUTE_S...

  • 80 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Sainath368 ,I wouldn't use photon for this kind of task. You should use it primarly for ETL transformations where it shines.VACUUM and OPTIMIZE are more of maintenance tasks and using photon would be pricey overkill here.According to documentatio...

  • 0 kudos
merca
by Valued Contributor II
  • 12016 Views
  • 13 replies
  • 7 kudos

Value array {{QUERY_RESULT_ROWS}} in Databricks SQL alerts custom template

Please include in documentation an example how to incorporate the `QUERY_RESULT_ROWS` variable in the custom template.

  • 12016 Views
  • 13 replies
  • 7 kudos
Latest Reply
CJK053000
New Contributor III
  • 7 kudos

Databricks confirmed this was an issue on their end and it should be resolved now. It is working for me.

  • 7 kudos
12 More Replies
dbdev
by New Contributor II
  • 559 Views
  • 8 replies
  • 3 kudos

Maven libraries in VNet injected, UC enabled workspace on Standard Access Mode Cluster

Hi!As the title suggests, I want to install Maven libaries on my cluster with access mode 'Standard'. Our workspace is VNet injected and has Unity Catalog enabled.The coordinates have been allowlisted by the account team according to these instructio...

dbdev_1-1756137297433.png dbdev_2-1756137354610.png dbdev_3-1756137433510.png
  • 559 Views
  • 8 replies
  • 3 kudos
Latest Reply
dbdev
New Contributor II
  • 3 kudos

@nayan_wylde @szymon_dybczak I just tried using a JAR I uploaded to an allowlisted Volume (ojdbc8 of oracle) and I get the same error. it seems like I'm able to install an JAR, but once it's installed my cluster is broken.

  • 3 kudos
7 More Replies
Vamsi_S
by New Contributor
  • 55 Views
  • 1 replies
  • 0 kudos

Ingest data from SQL Server

I've been working on data ingestion from SQL Server to UC using lakeflow connect. Lakeflow connect actually made the work easier when everything is right. I am trying to incorporate this with DAB and this would work fine with schema and table tags fo...

  • 55 Views
  • 1 replies
  • 0 kudos
Latest Reply
Khaja_Zaffer
Contributor
  • 0 kudos

Hello @Vamsi_S Good day! Did you Preprocessing Table Names in CI/CD and Generate YAML Dynamically (Recommended for Dynamic, Automated Ingestion)Did you contact your databricks account manager (incase if you working with a company) for feature request...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels