cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

PradeepRavi
by New Contributor III
  • 34321 Views
  • 6 replies
  • 10 kudos

How do I prevent _success and _committed files in my write output?

Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files. Note : Final output is stored in Azure ADLS

  • 34321 Views
  • 6 replies
  • 10 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 10 kudos

Please find the below steps to remove _SUCCESS, _committed and _started files.spark.conf.set("spark.databricks.io.directoryCommit.createSuccessFile","false") to remove success file.run vacuum command multiple times until _committed and _started files...

  • 10 kudos
5 More Replies
Jeff1
by Contributor II
  • 2413 Views
  • 3 replies
  • 5 kudos

Resolved! Understand Spark DataFrames verse R DataFrames

CommunityI’ve been struggling with utilizing R language in databricks and after reading “Mastering Spark with R,” I believe my initial problems stemmed from not understating the difference between Spark DataFrames and R DataFrames within the databric...

  • 2413 Views
  • 3 replies
  • 5 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 5 kudos

As Spark dataframes are handled in distributed way on workers it is better just to use Spark dataframes. Additionally collect is executed on driver and takes whole dataset into memory so it is shouldn't be used in production.

  • 5 kudos
2 More Replies
ninjadev999
by New Contributor II
  • 6746 Views
  • 7 replies
  • 1 kudos

Resolved! Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks

I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.This is my code in Python and PySpark:from ...

  • 6746 Views
  • 7 replies
  • 1 kudos
Latest Reply
User16764241763
Honored Contributor
  • 1 kudos

Hi,If you are using Azure SQL DB Managed instance, could you please file a support request with Azure team? This is to review any timeouts, perf issues on the backend.Also, it seems like the timeout is coming from SQL Server which is closing the conn...

  • 1 kudos
6 More Replies
Anonymous
by Not applicable
  • 24477 Views
  • 4 replies
  • 4 kudos

Resolved! Spark is not able to resolve the columns correctly when joins data frames

Hello all, I m using pyspark ( python 3.8) over spark3.0 on Databricks. When running this DataFrame join:next_df = days_currencies_matrix.alias('a').join( data_to_merge.alias('b') , [ days_currencies_matrix.dt == data_to_merge.RATE_DATE, days...

  • 24477 Views
  • 4 replies
  • 4 kudos
Latest Reply
Anonymous
Not applicable
  • 4 kudos

@Alessio Palma​ - Howdy! My name is Piper, and I'm a moderator for the community. Would you be happy to mark whichever answer solved your issue so other members may find the solution more quickly?

  • 4 kudos
3 More Replies
alexraj84
by New Contributor
  • 11426 Views
  • 2 replies
  • 0 kudos

How to read a fixed length file in Spark using DataFrame API and SCALA

I have a fixed length file ( a sample is shown below) and I want to read this file using DataFrames API in Spark using SCALA(not python or java). Using DataFrames API there are ways to read textFile, json file and so on but not sure if there is a wa...

  • 11426 Views
  • 2 replies
  • 0 kudos
Latest Reply
Nagendra
New Contributor II
  • 0 kudos

Find the below solution which can be used. Let us consider this is the data in the file. EMP ID   First Name              Last Name                       1Chris                   M                                                     2John            ...

  • 0 kudos
1 More Replies
MikeBrewer
by New Contributor II
  • 18888 Views
  • 3 replies
  • 0 kudos

Am trying to use SQL, but createOrReplaceTempView("myDataView")​ fails

Am trying to use SQL, but createOrReplaceTempView("myDataView") fails. I can create and display a DataFrame fine... import pandas as pd df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns = ['Amount']) df I add another cell, ...

  • 18888 Views
  • 3 replies
  • 0 kudos
Latest Reply
sachinthana
New Contributor II
  • 0 kudos

This is worked for me. Thank you @acorson​ 

  • 0 kudos
2 More Replies
User15787040559
by Databricks Employee
  • 4022 Views
  • 2 replies
  • 0 kudos

How to do a unionAll() when the number and the name of columns are different?

Looking at the API for Dataframe.unionAll() when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.How can you do it?One possible solution is using the following function which performs the union of tw...

  • 4022 Views
  • 2 replies
  • 0 kudos
Latest Reply
sean_owen
Databricks Employee
  • 0 kudos

I'm not sure union is the right tool, if the DataFrames have fundamentally different information in them. If the difference is merely column name, yes, rename. If they don't, then the 'union' contemplated here is really a union of columns as well as ...

  • 0 kudos
1 More Replies
NEERAJRATHORE19
by New Contributor
  • 10554 Views
  • 3 replies
  • 1 kudos

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition : Error

I am creating dataframe using SQL in which all the underline tables are actually tempview based on dataframes. I am getting below error everytime. Can anyone help me to uderstand the issue here. Thanks in advance.An error occurred while calling o183....

  • 10554 Views
  • 3 replies
  • 1 kudos
Latest Reply
htinhk
New Contributor II
  • 1 kudos

I also encountered the same problem...It's weird that I can do the query but not the count.

  • 1 kudos
2 More Replies
RaymondXie
by New Contributor
  • 8889 Views
  • 1 replies
  • 0 kudos

How to union multiple dataframe in pyspark within Databricks notebook

I have 4 DFs: Avg_OpenBy_Year, AvgHighBy_Year, AvgLowBy_Year and AvgClose_By_Year, all of them have a common column of 'Year'.I want to join the three together to get a final df like:`Year, Open, High, Low, Close`At the moment I have to use the ugly...

0693f000007OoI6AAK
  • 8889 Views
  • 1 replies
  • 0 kudos
Latest Reply
thiago_matos
New Contributor II
  • 0 kudos

Import reduce function in this way: from functools import reduce

  • 0 kudos
rlgarris
by Databricks Employee
  • 16314 Views
  • 12 replies
  • 0 kudos

Resolved! How do I create a single CSV file from multiple partitions in Databricks / Spark?

Using sparkcsv to write data to dbfs, which I plan to move to my laptop via standard s3 copy commands. The default for spark csv is to write output into partitions. I can force it to a single partition, but would really like to know if there is a ge...

  • 16314 Views
  • 12 replies
  • 0 kudos
Latest Reply
ChristianHomber
New Contributor II
  • 0 kudos

Without access to bash it would be highly appreciated if an option within databricks (e.g. via dbfsutils) existed.

  • 0 kudos
11 More Replies
SatheeshSathees
by New Contributor
  • 6999 Views
  • 1 replies
  • 0 kudos

how to dynamically explode array type column in pyspark or scala

HI, i have a parquet file with complex column types with nested structs and arrays. I am using the scrpit from below link to flatten my parquet file. https://docs.microsoft.com/en-us/azure/synapse-analytics/how-to-analyze-complex-schema I am able ...

  • 6999 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hello, Please check out the below docs and notebook which has similar examples, https://docs.microsoft.com/en-us/azure/synapse-analytics/how-to-analyze-complex-schemahttps://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/transform-comple...

  • 0 kudos
Nik
by New Contributor III
  • 13529 Views
  • 19 replies
  • 0 kudos

write from a Dataframe to a CSV file, CSV file is blank

Hi i am reading from a text file from a blob val sparkDF = spark.read.format(file_type) .option("header", "true") .option("inferSchema", "true") .option("delimiter", file_delimiter) .load(wasbs_string + "/" + PR_FileName) Then i test my Datafra...

  • 13529 Views
  • 19 replies
  • 0 kudos
Latest Reply
nl09
New Contributor II
  • 0 kudos

Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same. fpath=output+'/'+'temp' def file_exists(path): try: dbutils.fs.ls(path) return...

  • 0 kudos
18 More Replies
BingQian
by New Contributor II
  • 12983 Views
  • 2 replies
  • 0 kudos

Resolved! Error of "name 'IntegerType' is not defined" in attempting to convert a DF column to IntegerType

initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType)) Or initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType())) However, always failed with this error : NameError: name 'IntegerType' is not defined ...

  • 12983 Views
  • 2 replies
  • 0 kudos
Latest Reply
BingQian
New Contributor II
  • 0 kudos

Thank you @Kristo Raun​  !

  • 0 kudos
1 More Replies
cfregly
by Contributor
  • 5427 Views
  • 5 replies
  • 0 kudos
  • 5427 Views
  • 5 replies
  • 0 kudos
Latest Reply
srisre111
New Contributor II
  • 0 kudos

I am trying to store a dataframe as table in databricks and encountering the following error, can someone help? "typeerror: field date: can not merge type <class 'pyspark.sql.types.stringtype'> and <class 'pyspark.sql.types.doubletype'>"

  • 0 kudos
4 More Replies
Labels