cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

xiaozy
by New Contributor
  • 1020 Views
  • 2 replies
  • 1 kudos
  • 1020 Views
  • 2 replies
  • 1 kudos
Latest Reply
Prabakar
Esteemed Contributor III
  • 1 kudos

Hi @xiaojun wang​  please check the blog and let us know if this helps you.https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

  • 1 kudos
1 More Replies
dbu_spark
by New Contributor III
  • 5024 Views
  • 10 replies
  • 6 kudos

Older Spark Version loaded into the spark notebook

I have databricks runtime for a job set to latest 10.0 Beta (includes Apache Spark 3.2.0, Scala 2.12) .In the notebook when I check for the spark version, I see version 3.1.0 instead of version 3.2.0I need the Spark version 3.2 to process workloads a...

Screen Shot 2021-10-20 at 11.45.10 AM
  • 5024 Views
  • 10 replies
  • 6 kudos
Latest Reply
jose_gonzalez
Moderator
  • 6 kudos

hi @Dhaivat Upadhyay​ ,Good news, DBR 10 was release yesterday October 20th. You can find more details in the release notes website

  • 6 kudos
9 More Replies
D3nnisd
by New Contributor III
  • 11212 Views
  • 15 replies
  • 6 kudos

Resolved! BufferHolder Exceeded on Json flattening

On Databricks, we use the following code to flatten JSON in Python. The data is from a REST API:```df = spark.read.format("json").option("header", "true").option("multiline", "true").load(SourceFileFolder + sourcetable + "*.json")df2 = df.select(psf....

  • 11212 Views
  • 15 replies
  • 6 kudos
Latest Reply
Dan_Z
Honored Contributor
  • 6 kudos

@Dennis D​ , what's happening here is that more than 2 GB (2147483648 bytes) is being loaded into a single column value. This is a hard-limit for serialization. This KB article addresses it. The solution would be to find some way to have this loaded ...

  • 6 kudos
14 More Replies
Erik
by Valued Contributor II
  • 1277 Views
  • 4 replies
  • 3 kudos

Feature request: It is possible to add comments to both databricks sql databases and tables. It would be really usefull if these comments could show u...

Feature request: It is possible to add comments to both databricks sql databases and tables. It would be really usefull if these comments could show up (if they are provided) in PowerBI when one connects to the Databricks SQL endpoint, e.g. in this w...

bilde
  • 1277 Views
  • 4 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

Nice idea!

  • 3 kudos
3 More Replies
tarente
by New Contributor III
  • 2298 Views
  • 6 replies
  • 5 kudos

Resolved! How to implement the where not exists pattern in scala?

I have a dataframe with the following columns:Key1Key2Y_N_ColCol1Col2For the key tuple (Key1, Key2), I have rows with Y_N_Col = "Y" and Y_N_Col = "N".I need a new dataframe with all rows with Y_N_Col = "Y" (regardless of the key tuple), plus all Y_N_...

  • 2298 Views
  • 6 replies
  • 5 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 5 kudos

I'd use a left-anti join.So create a df with all the Y, then create a df with all the N and do a left_anti join (on key1 and key2) on the df with the Y.then a union of those two.

  • 5 kudos
5 More Replies
Programming_Sch
by New Contributor
  • 310 Views
  • 0 replies
  • 0 kudos

aws logo

What is the future of aws?The future of AWS is very promising. So, if you are thinking of a cloud career or want to switch your position to something related to the cloud, I would highly recommend you going for AWS training. No matter what field you ...

  • 310 Views
  • 0 replies
  • 0 kudos
xiaozy
by New Contributor
  • 1648 Views
  • 1 replies
  • 0 kudos
  • 1648 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @ xiaozy! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

  • 0 kudos
User16826992666
by Valued Contributor
  • 831 Views
  • 1 replies
  • 0 kudos

If data from a Delta table is cached in Databricks SQL and the table is altered in the backend, does it invalidate the cache?

Basically I'm worried about the scenario where data that gets cached on Databricks SQL endpoints becomes out of sync with the source Delta table. If that were to happen and data was read from the cache it would be out of date/incorrect. Is this a con...

  • 831 Views
  • 1 replies
  • 0 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 0 kudos

There are 3 types of caching. 1-Databricks SQL UI caching, 2-Query results caching , 3-Delta caching . (1) does not get invalidated. It's like your BI dashboard. BI dashboard needs to be manually refreshed.(2) and (3) gets auto invalidation.pls check...

  • 0 kudos
nlee
by New Contributor
  • 2387 Views
  • 1 replies
  • 1 kudos

Resolved! How to create a temporary file with sql

what are the commands to create a temporary file with SQL

  • 2387 Views
  • 1 replies
  • 1 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 1 kudos

In Spark SQL, you could use commands like "insert overwrite directory" that indirectly creates a temporary file with the datahttps://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-dml-insert-overwrite-directory.html#example...

  • 1 kudos
Sumeet_Dora
by New Contributor II
  • 1417 Views
  • 2 replies
  • 4 kudos

Resolved! Write mode features in Bigquey using Databricks notebook.

Currently using df.write.format("bigquery") ,Databricks only supports append and Overwrite modes in to writing Bigquery tables.Does Databricks has any option of executing the DMLs like Merge in to Bigquey using Databricks Notebooks.?

  • 1417 Views
  • 2 replies
  • 4 kudos
Latest Reply
mathan_pillai
Valued Contributor
  • 4 kudos

@Sumeet Dora​ , Unfortunately there is no direct "merge into" option for writing to Bigquery using Databricks notebook. You could write to an intermediate delta table using the "merge into" option in delta table. Then read from the delta table and pe...

  • 4 kudos
1 More Replies
gbrueckl
by Contributor II
  • 4711 Views
  • 10 replies
  • 9 kudos

Slow performance of VACUUM on Azure Data Lake Store Gen2

We need to run VACCUM on one of our biggest tables to free the storage. According to our analysis using VACUUM bigtable DRY RUN this affects 30M+ files that need to be deleted.If we run the final VACUUM, the file-listing takes up to 2h (which is OK) ...

  • 4711 Views
  • 10 replies
  • 9 kudos
Latest Reply
Deepak_Bhutada
Contributor III
  • 9 kudos

@Gerhard Brueckl​ we have seen near 80k-120k file deletions in Azure per hour while doing a VACUUM on delta tables, it's just that the vacuum is slower in azure and S3. It might take some time as you said when deleting the files from the delta path. ...

  • 9 kudos
9 More Replies
Erik
by Valued Contributor II
  • 3416 Views
  • 8 replies
  • 2 kudos

Run more than nr-of-cores concurrent tasks.

We are using the terraform databricks provier, which is starting a cluster and checking every mount (since there is no mount rest API!). Each mount takes 20 seconds to check, and 99.9% of that time is idle waiting, and it starts a job per mount. If w...

  • 3416 Views
  • 8 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

hi @Erik Parmann​ ,It is possible to do, but you might need to also enable dynamic allocation at the cluster level to be able to make sure your settings are apply at cluster creation . You can find more details here. As best practice, we do not recom...

  • 2 kudos
7 More Replies
Jon
by New Contributor II
  • 13479 Views
  • 3 replies
  • 5 kudos

How can I use custom python library in Azure Databricks?

I am trying to access functions in my coreapi.py by importing in the main notebook, but I have error ModuleNotFoundError: No module named 'coreapi'. I tried by uploading the file into the same folder and I tried creating a python egg and uploading it...

  • 13479 Views
  • 3 replies
  • 5 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 5 kudos

There is also the possibility to use Repos file functionality:https://databricks.com/blog/2021/10/07/databricks-repos-is-now-generally-available.html

  • 5 kudos
2 More Replies
RKNutalapati
by Valued Contributor
  • 2737 Views
  • 4 replies
  • 4 kudos

Read and saving Blob data from oracle to databricks S3 is slow

I am trying to import a table from oracle which has around 1.3 mill rows and one of the column is a Blob, the total size of data on oracle is around 250+ GB. read and save to S3 as delta table is taking around 60 min. I tried with parallel(200 thread...

  • 2737 Views
  • 4 replies
  • 4 kudos
Latest Reply
User16829050420
New Contributor III
  • 4 kudos

Hello @Rama Krishna N​ - We will need to check the task on the Spark UI to validate if the operation is a read from oracle database or write into S3. The task should show the specific operation on the UI.Also, the active threads on the Spark UI will ...

  • 4 kudos
3 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels