cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

AlexRomano
by New Contributor
  • 5411 Views
  • 1 replies
  • 0 kudos

PicklingError: Could not pickle the task to send it to the workers.

I am using sklearn in a databricks notebook to fit an estimator in parallel. Sklearn uses joblib with loky backend to do this. Now, I have file in databricks which I can import my custom Classifier from, and everything works fine. However, if I lite...

  • 5411 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi, aromano I know this issue was opened almost a year ago, but I faced the same problem and I was able to solve it. So, I'm sharing the solution in order to help others. Probably, you're using SparkTrials to optimize the model's hyperparameters ...

  • 0 kudos
Mir_SakhawatHos
by New Contributor II
  • 28082 Views
  • 2 replies
  • 3 kudos

How can I delete folders from my DBFS?

I want to delete my created folder from DBFS. But how? How can I download files from there?

  • 28082 Views
  • 2 replies
  • 3 kudos
Latest Reply
IA
New Contributor II
  • 3 kudos

Hello, Max answer focuses on the CLI. Instead, using the Community Edition Platform, proceed as follows: # You must first delete all files in your folder. 1. import org.apache.hadoop.fs.{Path, FileSystem}  2. dbutils.fs.rm("/FileStore/tables/file.cs...

  • 3 kudos
1 More Replies
bhaumikg
by New Contributor II
  • 11988 Views
  • 7 replies
  • 2 kudos

Databricks throwing error "SQL DW failed to execute the JDBC query produced by the connector." while pushing the column with string length more than 255

I am using databricks to transform the data and than pushing the data into datalake. the data is getting pushed in if the length of the string field is 255 or less but it throws following error if it is beyond that. "SQL DW failed to execute the JDB...

  • 11988 Views
  • 7 replies
  • 2 kudos
Latest Reply
bhaumikg
New Contributor II
  • 2 kudos

As suggested by ZAIvR, please use append and provide maxlength while pushing the data. Overwrite may not work with this unless databricks team has fixed the issue

  • 2 kudos
6 More Replies
Nik
by New Contributor III
  • 7308 Views
  • 19 replies
  • 0 kudos

write from a Dataframe to a CSV file, CSV file is blank

Hi i am reading from a text file from a blob val sparkDF = spark.read.format(file_type) .option("header", "true") .option("inferSchema", "true") .option("delimiter", file_delimiter) .load(wasbs_string + "/" + PR_FileName) Then i test my Datafra...

  • 7308 Views
  • 19 replies
  • 0 kudos
Latest Reply
nl09
New Contributor II
  • 0 kudos

Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same. fpath=output+'/'+'temp' def file_exists(path): try: dbutils.fs.ls(path) return...

  • 0 kudos
18 More Replies
pmezentsev
by New Contributor
  • 5672 Views
  • 7 replies
  • 0 kudos

Pyspark. How to get best params in grid search

Hello!I am using spark 2.1.1 in python(python 2.7 executed in jupyter notebook)And trying to make grid search for linear regression parameters.My code looks like this:from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml impo...

  • 5672 Views
  • 7 replies
  • 0 kudos
Latest Reply
phamyen
New Contributor II
  • 0 kudos

This is a great article. It gave me a lot of useful information. thank you very much download app

  • 0 kudos
6 More Replies
BingQian
by New Contributor II
  • 9825 Views
  • 2 replies
  • 0 kudos

Resolved! Error of "name 'IntegerType' is not defined" in attempting to convert a DF column to IntegerType

initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType)) Or initialDF .withColumn("OriginalCol", initialDF.OriginalCol.cast(IntegerType())) However, always failed with this error : NameError: name 'IntegerType' is not defined ...

  • 9825 Views
  • 2 replies
  • 0 kudos
Latest Reply
BingQian
New Contributor II
  • 0 kudos

Thank you @Kristo Raun​  !

  • 0 kudos
1 More Replies
prakharjain
by New Contributor
  • 12141 Views
  • 2 replies
  • 0 kudos

Resolved! I need to edit my parquet files, and change field name, replacing space by underscore

Hello, I am facing trouble as mentioned in following topics in stackoverflow, https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv https://stackoverflow.com/questions/38191157/spark-...

  • 12141 Views
  • 2 replies
  • 0 kudos
Latest Reply
DimitriBlyumin
New Contributor III
  • 0 kudos

One option is to use something other than Spark to read the problematic file, e.g. Pandas, if your file is small enough to fit on the driver node (Pandas will only run on the driver). If you have multiple files - you can loop through them and fix on...

  • 0 kudos
1 More Replies
ChristianHofste
by New Contributor II
  • 10449 Views
  • 1 replies
  • 0 kudos

Drop duplicates in Table

Hi, there is a function to delete data from a Delta Table: deltaTable = DeltaTable.forPath(spark, "/data/events/") deltaTable.delete(col("date") < "2017-01-01") But is there also a way to drop duplicates somehow? Like deltaTable.dropDuplicates()......

  • 10449 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Christian Hofstetter, You can check here for info on the same,https://docs.delta.io/0.4.0/delta-update.html#data-deduplication-when-writing-into-delta-tables

  • 0 kudos
JigaoLuo
by New Contributor
  • 4053 Views
  • 3 replies
  • 0 kudos

OPTIMIZE error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'OPTIMIZE'

Hi everyone. I am trying to learn the keyword OPTIMIZE from this blog using scala: https://docs.databricks.com/delta/optimizations/optimization-examples.html#delta-lake-on-databricks-optimizations-scala-notebook. But my local spark seems not able t...

  • 4053 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi Jigao, OPTIMIZE isn't in the open source delta API, so won't run on your local Spark instance - https://docs.delta.io/latest/api/scala/io/delta/tables/index.html?search=optimize

  • 0 kudos
2 More Replies
EricThomas
by New Contributor
  • 9470 Views
  • 2 replies
  • 0 kudos

!pip install vs. dbutils.library.installPyPI()

Hello, Scenario: Trying to install some python modules into a notebook (scoped to just the notebook) using...``` dbutils.library.installPyPI("azure-identity") dbutils.library.installPyPI("azure-storage-blob") dbutils.library.restartPython()``` ...ge...

  • 9470 Views
  • 2 replies
  • 0 kudos
Latest Reply
eishbis
New Contributor II
  • 0 kudos

Hi @ericOnline I also faced the same issue and I eventually found that upgrading the databricks runtime version from my current "5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)" to "6.5(Scala 2.11,Spark 2.4.5) resolved this issue. Though the offic...

  • 0 kudos
1 More Replies
RaghuMundru
by New Contributor III
  • 26289 Views
  • 15 replies
  • 0 kudos

Resolved! I am running simple count and I am getting an error

Here is the error that I am getting when I run the following query statement=sqlContext.sql("SELECT count(*) FROM ARDATA_2015_09_01").show() ---------------------------------------------------------------------------Py4JJavaError Traceback (most rec...

  • 26289 Views
  • 15 replies
  • 0 kudos
Latest Reply
muchave
New Contributor II
  • 0 kudos

192.168.o.1 is a private IP address used to login the admin panel of a router. 192.168.l.l is the host address to change default router settings.

  • 0 kudos
14 More Replies
Anbazhagananbut
by New Contributor II
  • 5851 Views
  • 1 replies
  • 0 kudos

Get Size of a column in Bytes for a Pyspark Data frame

Hello All, I have a column in a dataframe which i struct type.I want to find the size of the column in bytes.it is getting failed while loading in snowflake.I could see size functions avialable to get the length.how to calculate the size in bytes fo...

  • 5851 Views
  • 1 replies
  • 0 kudos
Latest Reply
sean_owen
Honored Contributor II
  • 0 kudos

There isn't one size for a column; it takes some amount of bytes in memory, but a different amount potentially when serialized on disk or stored in Parquet. You can work out the size in memory from its data type; an array of 100 bytes takes 100 byte...

  • 0 kudos
ubsingh
by New Contributor II
  • 8841 Views
  • 3 replies
  • 1 kudos
  • 8841 Views
  • 3 replies
  • 1 kudos
Latest Reply
ubsingh
New Contributor II
  • 1 kudos

Thanks for you help @leedabee. I will go through second option, First one is not applicable in my case.

  • 1 kudos
2 More Replies
Anbazhagananbut
by New Contributor II
  • 7691 Views
  • 1 replies
  • 1 kudos

How to handle Blank values in Array of struct elements in pyspark

Hello All, We have a data in a column in pyspark dataframe having array of struct typehaving multiple nested fields present.if the value is not blank it will savethe data in the same array of struct type in spark delta table.please advise on the bel...

  • 7691 Views
  • 1 replies
  • 1 kudos
Latest Reply
shyam_9
Valued Contributor
  • 1 kudos

Hi @Anbazhagan anbutech17,Can you please try as in below answers,https://stackoverflow.com/questions/56942683/how-to-add-null-columns-to-complex-array-struct-in-spark-with-a-udf

  • 1 kudos
Juan_MiguelTrin
by New Contributor
  • 6172 Views
  • 1 replies
  • 0 kudos

How to resolve our of memory error?

I have a data bricks notebook hosted on Azure. I am having this problem when doing INNER JOIN. I tried creating a much higher cluster configuration but it still making outOfMemoryError. org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquir...

  • 6172 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Juan Miguel Trinidad,can you please the below suggestions,http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-Unable-to-acquire-bytes-of-memory-td16773.html

  • 0 kudos
Labels