cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

GuidoPereyra_
by New Contributor II
  • 7979 Views
  • 2 replies
  • 0 kudos

Databricks Delta - UPDATE error

Hi, We got the following error when we tried to UPDATE a delta table running concurrent notebooks that all end with an update to the same table. " com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added matching 'true' by a ...

  • 7979 Views
  • 2 replies
  • 0 kudos
Latest Reply
GuidoPereyra_
New Contributor II
  • 0 kudos

Hi @matt@direction.consulting I just found the following doc https://docs.azuredatabricks.net/delta/isolation-level.html#set-the-isolation-level. In my case, I could fixed partitioning the table and I think is the only way for concurrent update in t...

  • 0 kudos
1 More Replies
kali_tummala
by New Contributor II
  • 11065 Views
  • 5 replies
  • 0 kudos

Why Databricks spark is faster than AWS EMR Spark ?

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html Hi All, just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have ...

  • 11065 Views
  • 5 replies
  • 0 kudos
Latest Reply
RafiKurlansik
Databricks Employee
  • 0 kudos

I think you can get some pretty good insight into the optimizations on Databricks here:https://docs.databricks.com/delta/delta-on-databricks.html Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a grea...

  • 0 kudos
4 More Replies
DanielAnderson
by New Contributor
  • 7073 Views
  • 1 replies
  • 0 kudos

"AmazonS3Exception: The bucket is in this region" error

I have read access to an S3 bucket in an AWS account that is not mine. For more than a year I've had a job successfully reading from that bucket using dbutils.fs.mount(...) and sqlContext.read.json(...). Recently the job started failing with the exc...

  • 7073 Views
  • 1 replies
  • 0 kudos
Latest Reply
Chandan
New Contributor II
  • 0 kudos

@andersource Looks like the bucket is in us-east-1 but you've configured your AmazonS3 Cloud platform with us-west-2. Can you try switching configuring the client to use us-east-1 ? I hope it will work for you. Thank you

  • 0 kudos
User16301465121
by New Contributor
  • 11412 Views
  • 3 replies
  • 0 kudos

How can I exit from a Notebook which is used as a job?

How can I quit from a notebook in the middle of an execution based on some condition?

  • 11412 Views
  • 3 replies
  • 0 kudos
Latest Reply
SamsonXia
New Contributor II
  • 0 kudos

exit(value: String): voidCalling dbutils.notebook.exit in a job causes the notebook to complete successfully. If you want to cause the job to fail, throw an exception.

  • 0 kudos
2 More Replies
_not_provid1755
by New Contributor
  • 7477 Views
  • 3 replies
  • 0 kudos

Write empty dataframe into csv

I'm writing my output (entity) data frame into csv file. Below statement works well when the data frame is non-empty. entity.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", "true").save(tempLocation) It's not working wh...

  • 7477 Views
  • 3 replies
  • 0 kudos
Latest Reply
mrnov
New Contributor II
  • 0 kudos

the same problem here (similar code and the same behavior with Spark 2.4.0, running with spark submit on Win and on Lin) dataset.coalesce(1) .write() .option("charset", "UTF-8") .option("header", "true") .mode(SaveMod...

  • 0 kudos
2 More Replies
rishigc
by New Contributor
  • 18519 Views
  • 1 replies
  • 0 kudos

Split a row into multiple rows based on a column value in Spark SQL

Hi, I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding col...

  • 18519 Views
  • 1 replies
  • 0 kudos
Latest Reply
mathan_pillai
Databricks Employee
  • 0 kudos

Hi @rishigc You can use something like below. SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df or df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select( ...

  • 0 kudos
siddhu308
by New Contributor II
  • 7030 Views
  • 2 replies
  • 0 kudos

column wise sum in PySpark dataframe

i have a dataframe of 18000000rows and 1322 column with '0' and '1' value. want to find how many '1's are in every column ??? below is DataSet se_00001 se_00007 se_00036 se_00100 se_0010p se_00250

  • 7030 Views
  • 2 replies
  • 0 kudos
Latest Reply
mathan_pillai
Databricks Employee
  • 0 kudos

Hi Siddhu, You can use df.select(sum("col1"), sum("col2"), sum("col3")) where col1, col2, col3 are the column names for which you would like to find the sum please let us know if it answers your question Thanks

  • 0 kudos
1 More Replies
Pascalvan_Belle
by New Contributor
  • 9457 Views
  • 1 replies
  • 0 kudos

How to create a surrogate key sequence which I can use in SCD cases?

Hi Community I would like to know if there is an option to create an integer sequence which persists even if the cluster is shut down. My target is to use this integer value as a surrogate key to join different tables or do Slowly changing dimensio...

  • 9457 Views
  • 1 replies
  • 0 kudos
Latest Reply
girivaratharaja
New Contributor III
  • 0 kudos

Hi @pascalvanbellen ,There is no concept of FK, PK, SK in Spark. But Databricks Delta automatically takes care of SCD type scenarios. https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html#slowly-changing-data-scd-type-2 ...

  • 0 kudos
srchella
by New Contributor
  • 4001 Views
  • 1 replies
  • 0 kudos

How to take distinct of multiple columns ( > than 2 columns) in pyspark datafarme ?

I have 10+ columns and want to take distinct rows by multiple columns into consideration. How to achieve this using pyspark dataframe functions ?

  • 4001 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sandeep
Contributor III
  • 0 kudos

You can use dropDuplicates https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=distinct#pyspark.sql.DataFrame.dropDuplicates

  • 0 kudos
cfregly
by Contributor
  • 21494 Views
  • 15 replies
  • 0 kudos
  • 21494 Views
  • 15 replies
  • 0 kudos
Latest Reply
wildhogg
New Contributor II
  • 0 kudos

Well, just a little bit research, and i found this post below: Hopefully this will help. " registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's high...

  • 0 kudos
14 More Replies
DavidWrench
by New Contributor II
  • 20356 Views
  • 4 replies
  • 0 kudos

Displaying HTML Output

I am trying to display the html output or read in an html file to display in databricks notebook from pandas-profiling.import pandas as pd import pandas_profiling df = pd.read_csv("/dbfs/FileStore/tables/my_data.csv", header='infer', parse_dates=Tru...

  • 20356 Views
  • 4 replies
  • 0 kudos
Latest Reply
Bendu_Preez
New Contributor II
  • 0 kudos

What eventually worked for me was displayHTML(profile.to_html()) for the pandas_profiling and displayHTML(profile.html) for the spark_profiling.

  • 0 kudos
3 More Replies
AdamArold
by New Contributor
  • 6467 Views
  • 4 replies
  • 0 kudos

How can I integrate DataBricks into PyCharm?

Editing notebooks on DataBricks is rather cumbersome because it lacks a lot of features IDEs like PyCharm have. Another problem is that a DataBricks notebook comes with some local state which are not present on my computer. How can I edit notebooks ...

  • 6467 Views
  • 4 replies
  • 0 kudos
Latest Reply
SimonD_Morias
New Contributor II
  • 0 kudos

The documents are out for databricks-connect: https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html I've also written up about a few limitations I have found - some with workarounds: https://datathirst.net/blog/2019/3/7/databricks-co...

  • 0 kudos
3 More Replies
microamp
by New Contributor II
  • 14627 Views
  • 12 replies
  • 0 kudos

Azure Data Lake Config Issue: No value for dfs.adls.oauth2.access.token.provider found in conf file.

Hi,I have files hosted on an Azure Data Lake Store which I can connect from Azure Databricks configured as per instructions here.I can read JSON files fine, however, I'm getting the following error when I try to read an Avro file.spark.read.format("c...

  • 14627 Views
  • 12 replies
  • 0 kudos
Latest Reply
User16301467523
New Contributor II
  • 0 kudos

Taras's answer is correct. Because spark-avro is based on the RDD APIs, the properties must be set in the hadoopConfiguration options. Please note these docs for configuration using the RDD API: https://docs.azuredatabricks.net/spark/latest/data-sou...

  • 0 kudos
11 More Replies
PranjalThapar
by New Contributor
  • 8817 Views
  • 4 replies
  • 0 kudos

Splitting Date into Year, Month and Day, with inconsistent delimiters

I am trying to split my Date Column which is a String Type right now into 3 columns Year, Month and Date. I use (PySpark): <code>split_date=pyspark.sql.functions.split(df['Date'], '-') df= df.withColumn('Year', split_date.getItem(0)) df= df.wit...

  • 8817 Views
  • 4 replies
  • 0 kudos
Latest Reply
youssefassouli
New Contributor II
  • 0 kudos

thank you so much that was halpful

  • 0 kudos
3 More Replies
dan11
by New Contributor II
  • 5031 Views
  • 4 replies
  • 1 kudos

sql delete?

<pre> Hello databricks people, I started working with databricks today. I have a sql script which I developed with sqlite3 on a laptop. I want to port the script to databricks. I started with two sql statements: select count(prop_id) from prop0; del...

  • 5031 Views
  • 4 replies
  • 1 kudos
Latest Reply
Bill_Chambers
Contributor II
  • 1 kudos

Hey Dan, good to hear you're getting started with Databricks. This is not a limitation of Databricks it's a restriction built into Spark itself. Spark is not a data store, it's a distributed computation framework. Therefore deleting data would be un...

  • 1 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels