cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

brickster_2018
by Databricks Employee
  • 7781 Views
  • 2 replies
  • 0 kudos

Resolved! How does Delta solve the large number of small file problems?

Delta creates more small files during merge and updates operations.

  • 7781 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

Delta solves the large number of small file problems using the below operations available for a Delta table. Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By defau...

  • 0 kudos
1 More Replies
ranged_coop
by Valued Contributor II
  • 21434 Views
  • 22 replies
  • 28 kudos

How to install Chromium Browser and Chrome Driver on DBX runtime 10.4 and above ?

Hi Team,We are wondering if there is a recommended way to install the chromium browser and chrome driver on Databricks Runtime 10.4 and above ?I have been through the site and have come across several links to this effect, but they all seem to be ins...

  • 21434 Views
  • 22 replies
  • 28 kudos
Latest Reply
Kaizen
Valued Contributor
  • 28 kudos

Look into Playwrite instead of Selenium. I went through the same process y'all went through here (ended up writing a init script to install the drivers etc)This is all done for you in playwright. Refer to this post - I hope it helps!!https://communit...

  • 28 kudos
21 More Replies
seefoods
by New Contributor III
  • 1096 Views
  • 2 replies
  • 0 kudos

cluster metrics collection

Hello @Debayan please how can i collect metrics provided by clusters metrics for databricks runtime 13.1 or latest using shell bash script. Cordially, Aubert EMAKO

  • 1096 Views
  • 2 replies
  • 0 kudos
Latest Reply
Debayan
Databricks Employee
  • 0 kudos

Hi, Cluster metrics is an UI tool and available in the UI only.  For reference:  https://docs.databricks.com/en/compute/cluster-metrics.html

  • 0 kudos
1 More Replies
chari
by Contributor
  • 7121 Views
  • 2 replies
  • 0 kudos

writing spark dataframe as CSV to a repo

Hi,I wrote a spark dataframe as csv to a repo (synced with github). But when I checked the folder, the file wasn't there. Here is my code:spark_df.write.format('csv').option('header','true').mode('overwrite').save('/Repos/abcd/mno/data') No error mes...

  • 7121 Views
  • 2 replies
  • 0 kudos
Latest Reply
feiyun0112
Honored Contributor
  • 0 kudos

 the folder 'Repos' is not your repo, it's `dbfs:/Repos`, please checkdbutils.fs.ls('/Repos/abcd/mno/data') 

  • 0 kudos
1 More Replies
Salman1
by New Contributor
  • 1043 Views
  • 0 replies
  • 0 kudos

Cannot find UDF on subsequent job runs on same cluster.

Hello, I am trying to run jobs with a JAR task type using databricks on AWS on an all-purpose cluster. The issue I'm facing is that the job will complete the first run successfully but on any subsequent runs, it will fail. I have to restart my cluste...

  • 1043 Views
  • 0 replies
  • 0 kudos
chari
by Contributor
  • 3403 Views
  • 2 replies
  • 0 kudos

Fatal error when writing a big pandas dF

Hello DB community,I was trying to write a pandas dataframe containing 100000 rows as excel. Moments in the execution I received a fatal error : "Python kernel is unresponsive."However, I am constrained from increasing the number of clusters or other...

Data Engineering
Databricks
excel
python
  • 3403 Views
  • 2 replies
  • 0 kudos
Latest Reply
Ayushi_Suthar
Databricks Employee
  • 0 kudos

Hi @chari ,Thanks for bringing up your concerns, always happy to help  We understand that you are facing the following error while you are writing a pandas dataframe containing 100000rows in excel. As per the Error >>> Fatal error: The Python kernel ...

  • 0 kudos
1 More Replies
Yaacoub
by New Contributor
  • 9282 Views
  • 2 replies
  • 1 kudos

[UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs

In my project I defined a UDF: @udf(returnType=IntegerType()) def ends_with_one(value, bit_position): if bit_position + len(value) < 0: return 0 else: return int(value[bit_position] == '1') spark.udf.register("ends_with_one"...

  • 9282 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @Yaacoub, Just a friendly follow-up. Have you had a chance to review my colleague's reply? Please inform us if it contributes to resolving your query.

  • 1 kudos
1 More Replies
abelian-grape
by New Contributor II
  • 7620 Views
  • 4 replies
  • 0 kudos

Intermittent error databricks job kept running

Hi i have the following error, but the job kept running, is that normal?{     "message": "The service at /api/2.0/jobs/runs/get?run_id=899157004942769 is temporarily unavailable. Please try again later. [TraceId: -]",     "error_code": "TEMPORARILY_U...

  • 7620 Views
  • 4 replies
  • 0 kudos
Latest Reply
abelian-grape
New Contributor II
  • 0 kudos

@Ayushi_Suthar also when ever it happens the job status does not change to "failed". But it keeps running. Is that normal?

  • 0 kudos
3 More Replies
joao_vnb
by New Contributor III
  • 58783 Views
  • 7 replies
  • 11 kudos

Resolved! Automate the Databricks workflow deployment

Hi everyone,Do you guys know if it's possible to automate the Databricks workflow deployment through azure devops (like what we do with the deployment of notebooks)?

  • 58783 Views
  • 7 replies
  • 11 kudos
Latest Reply
asingamaneni
New Contributor II
  • 11 kudos

Did you get a chance to try Brickflows - https://github.com/Nike-Inc/brickflowYou can find the documentation here - https://engineering.nike.com/brickflow/v0.11.2/Brickflow uses - Databricks Asset Bundles(DAB) under the hood but provides a Pythonic w...

  • 11 kudos
6 More Replies
isaac_gritz
by Databricks Employee
  • 8121 Views
  • 1 replies
  • 2 kudos

Change Data Capture with Databricks

How to leverage Change Data Capture (CDC) from your databases to DatabricksChange Data Capture allows you to ingest and process only changed records from database systems to dramatically reduce data processing costs and enable real-time use cases suc...

  • 8121 Views
  • 1 replies
  • 2 kudos
Latest Reply
prasad95
New Contributor III
  • 2 kudos

Hi, @isaac_gritz can you provide any reference resource to achieve the AWS DynamoDB CDC to Delta Tables.Thank You,

  • 2 kudos
DatBoi
by Contributor
  • 5720 Views
  • 2 replies
  • 1 kudos

Resolved! What happens to table created with CTAS statement when data in source table has changed

Hey all - I am sure this has been documented / answered before but what happens to a table created with a CTAS statement when data in the source table has changed? Does the sink table reflect the changes? Or is the data stored when the table is defin...

  • 5720 Views
  • 2 replies
  • 1 kudos
Latest Reply
SergeRielau
Databricks Employee
  • 1 kudos

CREATE TABLE AS (CTAS) is a "one and done" kind of statement.The new table retains no memory on how it came to be.Therefore it will be oblivious to changes in the source.Views, as you say, stored queries, no data is persisted. And therefore the query...

  • 1 kudos
1 More Replies
Dhruv-22
by New Contributor III
  • 10046 Views
  • 4 replies
  • 1 kudos

Resolved! Managed table overwrites existing location for delta but not for oth

I am working on Azure Databricks, with Databricks Runtime version being - 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12). I am facing the following issue.Suppose I have a view named v1 and a database f1_processed created from the following comman...

  • 10046 Views
  • 4 replies
  • 1 kudos
Latest Reply
Red_blue_green
New Contributor III
  • 1 kudos

Hi,this is how the delta format work. With overwrite you are not deleting the files in the folder or replacing them. Delta is creating a new file with the overwritten schema and data. This way you are also able to return to former versions of the del...

  • 1 kudos
3 More Replies
sanjay
by Valued Contributor II
  • 12494 Views
  • 1 replies
  • 0 kudos

pyspark dropDuplicates performance issue

Hi,I am trying to delete duplicate records found by key but its very slow.  Its continuous running pipeline so data is not that huge but still it takes time to execute this command.df = df.dropDuplicates(["fileName"])Is there any better approach to d...

  • 12494 Views
  • 1 replies
  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels