Data Engineering

Forum Posts

Sorted by:

by BingQian • New Contributor II

05-23-2020 10:00:55 PM

16513 Views
2 replies
0 kudos

Resolved! Error of "name 'IntegerType' is not defined" in attempting to convert a DF column to IntegerType

initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType)) Or initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType())) However, always failed with this error : NameError: name 'IntegerType' is not defined ...

Data Engineering

16513 Views
2 replies
0 kudos

05-23-2020 10:00:55 PM

View Replies

Latest Reply

BingQian
New Contributor II

05-24-2020 6:50:09 PM

0 kudos

Thank you @Kristo Raun !

0 kudos

05-24-2020 6:50:09 PM

1 More Replies

by prakharjain • New Contributor

03-02-2020 10:34:16 AM

29091 Views
2 replies
0 kudos

Resolved! I need to edit my parquet files, and change field name, replacing space by underscore

Hello, I am facing trouble as mentioned in following topics in stackoverflow, https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv https://stackoverflow.com/questions/38191157/spark-...

Data Engineering

29091 Views
2 replies
0 kudos

03-02-2020 10:34:16 AM

View Replies

Latest Reply

DimitriBlyumin
New Contributor III

05-21-2020 4:48:22 AM

0 kudos

One option is to use something other than Spark to read the problematic file, e.g. Pandas, if your file is small enough to fit on the driver node (Pandas will only run on the driver). If you have multiple files - you can loop through them and fix on...

0 kudos

05-21-2020 4:48:22 AM

1 More Replies

by ChristianHofste • New Contributor II

05-07-2020 4:21:37 AM

12416 Views
1 replies
0 kudos

Drop duplicates in Table

Hi, there is a function to delete data from a Delta Table: deltaTable = DeltaTable.forPath(spark, "/data/events/") deltaTable.delete(col("date") < "2017-01-01") But is there also a way to drop duplicates somehow? Like deltaTable.dropDuplicates()......

Data Engineering

12416 Views
1 replies
0 kudos

05-07-2020 4:21:37 AM

View Replies

Latest Reply

shyam_9
Databricks Employee

05-19-2020 2:55:34 PM

0 kudos

Hi @Christian Hofstetter, You can check here for info on the same,https://docs.delta.io/0.4.0/delta-update.html#data-deduplication-when-writing-into-delta-tables

0 kudos

05-19-2020 2:55:34 PM

by JigaoLuo • New Contributor

12-25-2019 4:01:36 AM

7811 Views
3 replies
0 kudos

OPTIMIZE error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'OPTIMIZE'

Hi everyone. I am trying to learn the keyword OPTIMIZE from this blog using scala: https://docs.databricks.com/delta/optimizations/optimization-examples.html#delta-lake-on-databricks-optimizations-scala-notebook. But my local spark seems not able t...

Data Engineering

7811 Views
3 replies
0 kudos

12-25-2019 4:01:36 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-13-2020 2:30:18 PM

0 kudos

Hi Jigao, OPTIMIZE isn't in the open source delta API, so won't run on your local Spark instance - https://docs.delta.io/latest/api/scala/io/delta/tables/index.html?search=optimize

0 kudos

05-13-2020 2:30:18 PM

2 More Replies

by EricThomas • New Contributor

04-24-2020 4:44:25 PM

14605 Views
2 replies
0 kudos

!pip install vs. dbutils.library.installPyPI()

Hello, Scenario: Trying to install some python modules into a notebook (scoped to just the notebook) using...``` dbutils.library.installPyPI("azure-identity") dbutils.library.installPyPI("azure-storage-blob") dbutils.library.restartPython()``` ...ge...

Data Engineering

14605 Views
2 replies
0 kudos

04-24-2020 4:44:25 PM

View Replies

Latest Reply

eishbis
New Contributor II

04-27-2020 8:13:59 PM

0 kudos

Hi @ericOnline I also faced the same issue and I eventually found that upgrading the databricks runtime version from my current "5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)" to "6.5(Scala 2.11,Spark 2.4.5) resolved this issue. Though the offic...

0 kudos

04-27-2020 8:13:59 PM

1 More Replies

by RaghuMundru • New Contributor III

10-15-2015 7:11:03 AM

48522 Views
15 replies
0 kudos

Resolved! I am running simple count and I am getting an error

Here is the error that I am getting when I run the following query statement=sqlContext.sql("SELECT count(*) FROM ARDATA_2015_09_01").show() ---------------------------------------------------------------------------Py4JJavaError Traceback (most rec...

Data Engineering

48522 Views
15 replies
0 kudos

10-15-2015 7:11:03 AM

View Replies

Latest Reply

muchave
New Contributor II

02-16-2020 8:29:38 PM

0 kudos

192.168.o.1 is a private IP address used to login the admin panel of a router. 192.168.l.l is the host address to change default router settings.

0 kudos

02-16-2020 8:29:38 PM

14 More Replies

by Anbazhagananbut • New Contributor II

04-16-2020 11:49:21 AM

8627 Views
1 replies
0 kudos

Get Size of a column in Bytes for a Pyspark Data frame

Hello All, I have a column in a dataframe which i struct type.I want to find the size of the column in bytes.it is getting failed while loading in snowflake.I could see size functions avialable to get the length.how to calculate the size in bytes fo...

Data Engineering

8627 Views
1 replies
0 kudos

04-16-2020 11:49:21 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

04-17-2020 2:16:43 PM

0 kudos

There isn't one size for a column; it takes some amount of bytes in memory, but a different amount potentially when serialized on disk or stored in Parquet. You can work out the size in memory from its data type; an array of 100 bytes takes 100 byte...

0 kudos

04-17-2020 2:16:43 PM

by ubsingh • New Contributor II

11-07-2019 3:44:50 AM

13643 Views
3 replies
1 kudos

Resolved! I want to create a function in azure Databricks notebook to send a email, based on a filter. Any leads are appriciated.

I have no idea from where to start

Data Engineering

13643 Views
3 replies
1 kudos

11-07-2019 3:44:50 AM

View Replies

Latest Reply

ubsingh
New Contributor II

11-13-2019 1:05:26 AM

1 kudos

Thanks for you help @leedabee. I will go through second option, First one is not applicable in my case.

1 kudos

11-13-2019 1:05:26 AM

2 More Replies

by Anbazhagananbut • New Contributor II

04-07-2020 11:14:28 PM

14146 Views
1 replies
1 kudos

How to handle Blank values in Array of struct elements in pyspark

Hello All, We have a data in a column in pyspark dataframe having array of struct typehaving multiple nested fields present.if the value is not blank it will savethe data in the same array of struct type in spark delta table.please advise on the bel...

Data Engineering

14146 Views
1 replies
1 kudos

04-07-2020 11:14:28 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

04-15-2020 12:05:23 PM

1 kudos

Hi @Anbazhagan anbutech17,Can you please try as in below answers,https://stackoverflow.com/questions/56942683/how-to-add-null-columns-to-complex-array-struct-in-spark-with-a-udf

1 kudos

04-15-2020 12:05:23 PM

by Juan_MiguelTrin • Databricks Partner

03-23-2020 3:23:16 PM

8195 Views
1 replies
0 kudos

How to resolve our of memory error?

I have a data bricks notebook hosted on Azure. I am having this problem when doing INNER JOIN. I tried creating a much higher cluster configuration but it still making outOfMemoryError. org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquir...

Data Engineering

8195 Views
1 replies
0 kudos

03-23-2020 3:23:16 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

03-30-2020 1:30:56 PM

0 kudos

Hi @Juan Miguel Trinidad,can you please the below suggestions,http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-td16773.html

0 kudos

03-30-2020 1:30:56 PM

by SohelKhan • New Contributor II

02-21-2016 10:27:36 PM

18492 Views
3 replies
0 kudos

PySpark DataFrame: Select all but one or a set of columns

In SQL select, in some implementation, we can provide select -col_A to select all columns except the col_A. I tried it in the Spark 1.6.0 as follows: For a dataframe df with three columns col_A, col_B, col_C df.select('col_B, 'col_C') # it works df....

Data Engineering

18492 Views
3 replies
0 kudos

02-21-2016 10:27:36 PM

View Replies

Latest Reply

NavitaJain
New Contributor II

03-25-2020 4:21:12 PM

0 kudos

@sk777, @zjffdu, @Lejla Metohajrova if your columns are time-series ordered OR you want to maintain their original order... use cols = [c for c in df.columns if c != 'col_A'] df[cols]

0 kudos

03-25-2020 4:21:12 PM

2 More Replies

by AmitSukralia • New Contributor

06-02-2019 4:22:04 AM

36942 Views
5 replies
0 kudos

Listing all files under an Azure Data Lake Gen2 container

I am trying to find a way to list all files in an Azure Data Lake Gen2 container. I have mounted the storage account and can see the list of files in a folder (a container can have multiple level of folder hierarchies) if I know the exact path of th...

Data Engineering

36942 Views
5 replies
0 kudos

06-02-2019 4:22:04 AM

View Replies

Latest Reply

Balaji_su
New Contributor II

03-22-2020 10:04:37 AM

0 kudos

stackoverflow.pngfiles.txt

0 kudos

03-22-2020 10:04:37 AM

4 More Replies

by cfregly • Contributor

04-28-2015 1:03:12 PM

8916 Views
5 replies
0 kudos

How do I cast using a DataFrame?

Data Engineering

8916 Views
5 replies
0 kudos

04-28-2015 1:03:12 PM

View Replies

Latest Reply

srisre111
New Contributor II

03-19-2020 7:24:38 AM

0 kudos

I am trying to store a dataframe as table in databricks and encountering the following error, can someone help? "typeerror: field date: can not merge type <class 'pyspark.sql.types.stringtype'> and <class 'pyspark.sql.types.doubletype'>"

0 kudos

03-19-2020 7:24:38 AM

4 More Replies

by dhanunjaya • New Contributor II

09-20-2018 12:33:32 AM

11919 Views
6 replies
0 kudos

how to remove empty rows from the data frame.

lets assume if i have 10 columns in a data frame,all 10 columns has empty values for 100 rows out of 200 rows, how i can skip the empty rows?

Data Engineering

11919 Views
6 replies
0 kudos

09-20-2018 12:33:32 AM

View Replies

Latest Reply

GaryDiaz
New Contributor II

03-18-2020 10:44:03 AM

0 kudos

you can try this: df.na.drop(how = "all"), this will remove the row only if all the rows are null or NaN

0 kudos

03-18-2020 10:44:03 AM

5 More Replies

by AlaQabaja • New Contributor II

09-19-2019 10:02:46 AM

8192 Views
3 replies
0 kudos

Get last modified date or create date for azure blob container

Hi Everyone, I am trying to implement a way in Python to only read files that weren't loaded since the last run of my notebook. The way I am thinking of implementing this is to keep of the last time my notebook has finished in a database table. Nex...

Data Engineering

8192 Views
3 replies
0 kudos

09-19-2019 10:02:46 AM

View Replies

Latest Reply

Forum_Admin
Databricks Employee

03-18-2020 5:25:37 AM

0 kudos

Hello! I just wanted to share my point of view on the topic of dating sites. I have been looking for a decent Asian catch-up site for a very long time, in addition to them I found https://hookupsearch.org/asian-hookup-sites/. We definitely recommend...

0 kudos

03-18-2020 5:25:37 AM

2 More Replies

Databricks Community

Forum Posts

Resolved! Error of "name 'IntegerType' is not defined" in attempting to convert a DF column to IntegerType

Resolved! I need to edit my parquet files, and change field name, replacing space by underscore

Drop duplicates in Table

OPTIMIZE error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'OPTIMIZE'

!pip install vs. dbutils.library.installPyPI()

Resolved! I am running simple count and I am getting an error

Get Size of a column in Bytes for a Pyspark Data frame

Resolved! I want to create a function in azure Databricks notebook to send a email, based on a filter. Any leads are appriciated.

How to handle Blank values in Array of struct elements in pyspark

How to resolve our of memory error?

PySpark DataFrame: Select all but one or a set of columns

Listing all files under an Azure Data Lake Gen2 container

How do I cast using a DataFrame?

how to remove empty rows from the data frame.

Get last modified date or create date for azure blob container

ALTER Not Working in Databricks for external Table...

Multi-Engine Lakehouses

Context engineering for Genie

Genie Ontology access on Free Edition

What happens to Databricks notebooks and jobs when...