cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

AP
by New Contributor III
  • 4073 Views
  • 5 replies
  • 3 kudos

Resolved! AutoOptimize, OPTIMIZE command and Vacuum command : Order, production implementation best practices

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.Should we enable "optimized writes" by setting the following at a workspace level?spark.conf.set...

  • 4073 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@AKSHAY PALLERLA​ Just checking in to see if you got a solution to the issue you shared above. Let us know!Thanks to @Werner Stinckens​ for jumping in, as always!

  • 3 kudos
4 More Replies
Jayesh
by New Contributor III
  • 2653 Views
  • 5 replies
  • 3 kudos

Resolved! How can we do data copy from Databricks SQL using notebook?

Hi Team, we have a scenario where we have to connect to the DataBricks SQL instance 1 from another DataBricks instance 2 using notebook or Azure Data Factory. Can you please help?

  • 2653 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Thanks for jumping in to help @Arvind Ravish​  @Hubert Dudek​ and @Artem Sheiko​ !

  • 3 kudos
4 More Replies
Jeade
by New Contributor II
  • 2913 Views
  • 3 replies
  • 1 kudos

Resolved! Pulling data from Azure Boards into databricks

Looking for best practises/examples on how to pull data (epics, features, PBIs) from Azure Boards into databricks for analysis.Any ideas/help appreciated!

  • 2913 Views
  • 3 replies
  • 1 kudos
Latest Reply
artsheiko
Databricks Employee
  • 1 kudos

you can use export to csv (link), push the file to the storage mounted to Databricks or just import the file obtained to dbfs

  • 1 kudos
2 More Replies
cralle
by New Contributor II
  • 5843 Views
  • 7 replies
  • 2 kudos

Resolved! Cannot display DataFrame when I filter by length

I have a DataFrame that I have created based on a couple of datasets and multiple operations. The DataFrame has multiple columns, one of which is a array of strings. But when I take the DataFrame and try to filter based upon the size of this array co...

image image
  • 5843 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

strange, works fine here. what version of databricks are you on?What you could do to identify the issue is to output the query plan (.explain).And also creating a new df for each transformation could help. Like that you can check step by step where...

  • 2 kudos
6 More Replies
tej1
by New Contributor III
  • 3605 Views
  • 5 replies
  • 7 kudos

Resolved! Trouble accessing `_metadata` column using cloudFiles in Delta Live Tables

We are building a delta live pipeline where we ingest csv files in AWS S3 using cloudFiles. And it is necessary to access the file modification timestamp of the file. As documented here, we tried selecting `_metadata` column in a task in delta live p...

  • 3605 Views
  • 5 replies
  • 7 kudos
Latest Reply
tej1
New Contributor III
  • 7 kudos

Update: We were able to test `_metadata` column feature in DLT "preview" mode (which is DBR 11.0). Databricks doesn't recommend production workloads when using "preview" mode, but nevertheless, glad to be using this feature in DLT.

  • 7 kudos
4 More Replies
alexgv12
by New Contributor III
  • 2609 Views
  • 2 replies
  • 3 kudos

delta table separate gold zone by different tenant

Hello, currently we have a process that builds with delta table the zones of bronze, silver and when it reaches gold we must create specific zones for each client because the schema changes, for this we create databases and separate tables, but when ...

image
  • 2609 Views
  • 2 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 3 kudos

Hi @alexander grajales vanegas​ Are you creating all the databases and tables in gold zone manually?If so, please check out DLT https://docs.databricks.com/data-engineering/delta-live-tables/index.html, it will take care of your complete pipeline by ...

  • 3 kudos
1 More Replies
GKKarthi
by New Contributor
  • 5050 Views
  • 6 replies
  • 2 kudos

Resolved! Databricks - Simba SparkJDBCDriver 500550 exception

We have a Denodo big data platform hosted on Databricks. Recently we have been facing the exception with message '[Simba][SparkJDBCDriver](500550)'  with the Databricks which interrupts the Databricks connection after the certain time Interval usuall...

  • 5050 Views
  • 6 replies
  • 2 kudos
Latest Reply
PFBOLIVEIRA
New Contributor II
  • 2 kudos

Hi All,We are also experiencing the same behavior:[Simba][SimbaSparkJDBCDriver] (500550) The next rowset buffer is already marked as consumed. The fetch thread might have terminated unexpectedly. Foreground thread ID: xxxx. Background thread ID: yyyy...

  • 2 kudos
5 More Replies
pankaj92
by New Contributor II
  • 4346 Views
  • 4 replies
  • 0 kudos

extract latest files from ADLS Gen2 mount point in databricks using pyspark

Hi Team,I am trying to get the latest files from an ADLS mount point directory. I am not sure how to extract latest files ,Last modified Date using Pyspark from ADLS Gen2 storage account. Please let me know asap. Thanks! I am looking forward your re...

  • 4346 Views
  • 4 replies
  • 0 kudos
Latest Reply
Sha_1890
New Contributor III
  • 0 kudos

Hi @pankaj92​ ,I wrote a Python code to pick a latest file from mnt location ,import ospath = "/dbfs/mnt/xxxx"filelist=[]for file_item in os.listdir(path):  filelist.append(file_item)file=len(filelist)print(filelist[file-1])Thanks

  • 0 kudos
3 More Replies
ivanychev
by Contributor II
  • 7463 Views
  • 5 replies
  • 2 kudos

Resolved! How to find out why the cluster is in PENDING state for so long?

I'm using Databricks on AWS. Our clusters are typically in PENDING state for 5-8 minutes after they are created. I would like to find out why (ec2 instance provisioning? docker image download is slow? ...?). The cluster logs are not helpful enough be...

  • 7463 Views
  • 5 replies
  • 2 kudos
Latest Reply
Prabakar
Databricks Employee
  • 2 kudos

hi @Sergey Ivanychev​ while the cluster is starting, you can see the status on the compute page. Hover the mouse pointer to the green rotating circle on the left of the cluster name. It will give a notification of what is happening on the cluster. Wh...

  • 2 kudos
4 More Replies
118004
by New Contributor II
  • 1848 Views
  • 1 replies
  • 2 kudos

Resolved! Installing pdpbox plugin on cluster

Hello,We are having issues installing the pdpbox library on a fresh cluster. This includes trying to upload and install a whl file, or using pip in a workbook. I have attached an example of an error received. Can anybody assist with installing the...

  • 1848 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

PDPbox is updated rarely, and it requires older versions of matplotlib (3.1.1)https://github.com/SauceCat/PDPboxIt tries to install but fails because matplotlib requires pkgconfig.The solution to that is to use Machine Learning runtime. There it will...

  • 2 kudos
PSY
by New Contributor III
  • 4527 Views
  • 5 replies
  • 2 kudos

Resolved! Updating git token fails

When updating an expired Azure DevOps personal access token (PAT) for git integration, I get the error message "Failed to save. Please try again.". The error persists with different tokens. Previously (months ago), updating the token did not result i...

Screenshot 2022-07-19 at 13.39.56
  • 4527 Views
  • 5 replies
  • 2 kudos
Latest Reply
Atanu
Databricks Employee
  • 2 kudos

Is this happening for all users @Pencho Yordanov​ 

  • 2 kudos
4 More Replies
al_joe
by Contributor
  • 4949 Views
  • 3 replies
  • 4 kudos

Resolved! Can I use Databricks CLI with community edition?

I installed the CLI but unable to configure it to connect to my instance -- as I am unable to find the "Generate Access tokens" option under User Settings page.Documentation does not say whether this feature is disabled for community edition.

  • 4949 Views
  • 3 replies
  • 4 kudos
Latest Reply
Prabakar
Databricks Employee
  • 4 kudos

hi @Al Jo​ we understand your interest in learning Databricks. However, the community edition is limited in features. Certain features are available only in the paid version. If you are interested, to use the full features, then I would suggest you g...

  • 4 kudos
2 More Replies
Ryan512
by New Contributor III
  • 1548 Views
  • 2 replies
  • 2 kudos

Autoloader (GCP) Custom PubSub Queue

I want to know if what I describe below is possible with AutoLoader in the Google Cloud Platform.Problem Description:We have GCS buckets for every client/account. Inside these buckets is a path/blob for each client's instances of our platform. A clie...

  • 1548 Views
  • 2 replies
  • 2 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 2 kudos

Hello @Ryan Ebanks​ Please let us know if more help is needed on this.

  • 2 kudos
1 More Replies
laus
by New Contributor III
  • 7986 Views
  • 6 replies
  • 3 kudos

Resolved! How to load a json file in pyspark with colon character in file name

Hi,I'm trying to load this json file which contains the colon character in its name: file_name.2022-03-05_11:30:00.json but I get the error in screenshot below saying that there is a relative path in an absolute url - Any idea how to read this file...

image
  • 7986 Views
  • 6 replies
  • 3 kudos
Latest Reply
Noopur_Nigam
Databricks Employee
  • 3 kudos

Hi @Laura Blancarte​ I hope that @Pearl Ubaru​'s answer would have helped you in resolving your issue.Please let us know if you need more help on this.

  • 3 kudos
5 More Replies
AP
by New Contributor III
  • 2416 Views
  • 2 replies
  • 2 kudos

How can we connect to the databricks managed metastore

Hi, I am trying to take advantage of the treasure trove of the information that metastore contains and take some actions to improve performance. In my case, the metastore is managed by databricks, we don't use external metastore.How can I connect to ...

  • 2416 Views
  • 2 replies
  • 2 kudos
Latest Reply
Prabakar
Databricks Employee
  • 2 kudos

@AKSHAY PALLERLA​ to get the jdbc/odbc information you can get it from the cluster configuration. In the cluster configuration page, under advanced options, you have JDBC/ODBC tab. Click on that tab and it should give you the details you are looking ...

  • 2 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels