cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Anonymous
by Not applicable
  • 11595 Views
  • 9 replies
  • 8 kudos

Resolved! data frame takes unusually long time to write for small data sets

We have configured workspace with own vpc. We need to extract data from DB2 and write as delta format. we tried to for 550k records with 230 columns, it took 50mins to complete the task. 15mn records takes more than 18hrs. Not sure why this takes suc...

  • 11595 Views
  • 9 replies
  • 8 kudos
Latest Reply
Sown7
New Contributor II
  • 8 kudos

facing same issue - I have ~ 700 k rows and I am trying to write this table but it takes forever to write. Previously one time it took only like 5 sec to write but after that whenever we update the analysis and rewrite the table it takes very long an...

  • 8 kudos
8 More Replies
MartinB
by Contributor III
  • 15025 Views
  • 5 replies
  • 3 kudos

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...

Example_SCD2
  • 15025 Views
  • 5 replies
  • 3 kudos
Latest Reply
ThePhil
New Contributor II
  • 3 kudos

Be aware, that in Databricks 15.2 LTS this behavior is broken.I cannot find the code, but most likely related to the following option:https://github.com/apache/spark/commit/c1c710e7da75b989f4d14e84e85f336bc10920e0#diff-f9ddcc6cba651c6ebfd34e29ef049c3...

  • 3 kudos
4 More Replies
databicky
by Contributor II
  • 22737 Views
  • 13 replies
  • 4 kudos
  • 22737 Views
  • 13 replies
  • 4 kudos
Latest Reply
FerArribas
Contributor
  • 4 kudos

Hi @Hubert Dudek​,​Pandas API doesn't support abfss protocol.You have three options:​If you need to use pandas, you can write the excel to the local file system (dbfs) and then move it to ABFSS (for example with dbutils)Write as csv directly in abfss...

  • 4 kudos
12 More Replies
amitdatabricksc
by New Contributor II
  • 13198 Views
  • 4 replies
  • 2 kudos

how to zip a dataframe

how to zip a dataframe so that i get a zipped csv output file. please share command. it is only 1 dataframe involved and not multiple. 

  • 13198 Views
  • 4 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

writing to a local directory does not work.See this topic:https://community.databricks.com/s/feed/0D53f00001M7hNlCAJ

  • 2 kudos
3 More Replies
Rani
by New Contributor
  • 10465 Views
  • 2 replies
  • 0 kudos

Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala

I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframeI am trying to implement a sample as explained below, I am quite new to th...

  • 10465 Views
  • 2 replies
  • 0 kudos
Latest Reply
subham0611
New Contributor II
  • 0 kudos

@raela I also have similar usecase. I am writing data to different databricks tables based on colum value.But I am getting insufficient disk space error and driver is getting killed. I am suspecting df.select(colName).distinct().collect()step is taki...

  • 0 kudos
1 More Replies
alexkit
by New Contributor II
  • 3804 Views
  • 4 replies
  • 3 kudos

ASP1.2 Error create database in Spark Programming with Databricks training

I'm on Demo and Lab in Dataframes section. I've imported the dbc into my company cluster and has run "%run ./Includes/Classroom-Setup" successfully. When i run the 1st sql command %sql CREATE TABLE IF NOT EXISTS events USING parquet OPTIONS (path "/m...

  • 3804 Views
  • 4 replies
  • 3 kudos
Latest Reply
KDOCKX
New Contributor II
  • 3 kudos

I had the same issue and solved it like this:In the includes folder, there is a reset notebook, run the first command, this unmounts all mounted databases.Go back to the ASP 1.2 notebook and run the %run ./Includes/Classroom-Setup codeblock.Then run ...

  • 3 kudos
3 More Replies
Ram443
by New Contributor III
  • 46786 Views
  • 9 replies
  • 5 kudos

Resolved! I created a data frame but was not able to see the data

Code to create a data frame:from pyspark.sql import SparkSessionspark=SparkSession.builder.appName("oracle_queries").master("local[4]")\  .config("spark.sql.warehouse.dir", "C:\\softwares\\git\\pyspark\\hive").getOrCreate()from pyspark.sql.functions ...

  • 46786 Views
  • 9 replies
  • 5 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 5 kudos

@ramanjaneyulu kancharla​  can you please select my answer as best answer

  • 5 kudos
8 More Replies
pcriado
by New Contributor III
  • 8737 Views
  • 2 replies
  • 1 kudos

Resolved! Requested array size exceeds VM limit when saving to feature table

Hi, I'm trying to process a small dataset (less than 300 Mb) composed by five queries that run with spark. The end result of those queries is parsed using python and merged into a data frame. Then I try to write this to a delta lake table using featu...

  • 8737 Views
  • 2 replies
  • 1 kudos
Latest Reply
pcriado
New Contributor III
  • 1 kudos

Hello, we have recently found that it's my user in particular that casues the memory issue. Two other users in my organization can run the same notebook without problems, but my user consistenly consumes all available ram and crashes the cluster... a...

  • 1 kudos
1 More Replies
etsyal1e2r3
by Honored Contributor
  • 11112 Views
  • 1 replies
  • 2 kudos

Resolved! Compiling Flattened Dataframe back to Struct Columns

I have a dataframe with this format of columns:[`first.second.third` , `alpha.bravo.test1` , `alpha.bravo.test2`]I'd like to get an output dataframe of this:[ `first` | `alpha` ] ---------------...

image
  • 11112 Views
  • 1 replies
  • 2 kudos
Latest Reply
etsyal1e2r3
Honored Contributor
  • 2 kudos

I have figured out the solution.

  • 2 kudos
konda1
by New Contributor
  • 1456 Views
  • 0 replies
  • 0 kudos

Getting Executor lost due to stage failure error on writing data frame to a delta table or any file like parquet or csv or avro

We are working on multiline nested ( multilevel).The file is read and flattened using pyspark and the data frame is showing data using display() method. when saving the same dataframe it is giving executor lost failure error.for some files it is givi...

  • 1456 Views
  • 0 replies
  • 0 kudos
Neil
by New Contributor
  • 6691 Views
  • 1 replies
  • 0 kudos

While trying to save the spark dataframe to delta table is taking too long

While working on video analytics task I need to save the image bytes to the delta table earlier extracted into the spark dataframe. While I want to over write a same delta table over the period of complete task and also the size of input data differs...

  • 6691 Views
  • 1 replies
  • 0 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 0 kudos

can you check the spark UI, to see where the time is spent?It can be a join, udf, ...

  • 0 kudos
kll
by New Contributor III
  • 1313 Views
  • 0 replies
  • 0 kudos

Spark DataFrame apply Databricks geospatial indexing functions

I have a spark DataFrame with `h3` hex ids and I am trying to obtain the polygon geometries. from pyspark.sql import SparkSession from pyspark.sql.functions import col, expr from pyspark.databricks.sql.functions import *   from mosaic import enable_m...

  • 1313 Views
  • 0 replies
  • 0 kudos
Vishal09k
by New Contributor II
  • 3612 Views
  • 1 replies
  • 3 kudos

Display Command Not showing the Result, Rather giving the Dataframe Schema

Display Command Not showing the Result, Rather giving the Dataframe Schema 

image image
  • 3612 Views
  • 1 replies
  • 3 kudos
Latest Reply
Rishabh-Pandey
Databricks MVP
  • 3 kudos

hey ,can you try you sql query with this methodselect * from (your sql query )

  • 3 kudos
arw1070
by New Contributor II
  • 3236 Views
  • 2 replies
  • 0 kudos

Downstream delta live table is unable to read data frame from upstream table

I have been trying to work on implementing delta live tables to a pre-existing workflow. Currently trying to create two tables: appointments_raw and notes_raw, where notes_raw is "downstream" of appointments_raw. Following this as a reference, I'm at...

image.png
  • 3236 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Anna Wuest​ : Could you please send me the code snippet here? Thanks.

  • 0 kudos
1 More Replies
afzi
by New Contributor II
  • 3988 Views
  • 1 replies
  • 1 kudos

Pandas DataFrame error when using to_csv

Hi Everyone, I would like to a Pandas Dataframe to /dbfs/FileStore/ using to_csv method.Usually it would just write the Dataframe to the path described but It has been giving me "FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/FileStor...

  • 3988 Views
  • 1 replies
  • 1 kudos
Latest Reply
Avinash_94
Databricks Employee
  • 1 kudos

f = open("/dbfs/mnt/blob/myNames.txt", "r")

  • 1 kudos
Labels