Data Engineering

Forum Posts

Sorted by:

by AnaDel_Campo_Me • New Contributor

01-02-2020 8:36:21 AM

10340 Views
2 replies
1 kudos

FileNotFoundError: [Errno 2] No such file or directory or IsADirectoryError: [Errno 21] Is a directory

I have been trying to open a file on the dbfs using all different combinations: if I use the following code: with open("/dbfs/FileStore/df/Downloadedfile.csv", 'r', newline='') as f I get IsADirectoryError: [Errno 21] Is a directory with open("dbfs:...

Data Engineering

10340 Views
2 replies
1 kudos

01-02-2020 8:36:21 AM

View Replies

Latest Reply

paulmark
New Contributor II

02-19-2020 10:15:37 PM

1 kudos

To get rid of this error you can try using Python file exists methods to check that at least python sees the file exists or not. In other words, you can make sure that the user has indeed typed a correct path for a real existing file. If the user do...

1 kudos

02-19-2020 10:15:37 PM

1 More Replies

by Seenu45 • New Contributor II

10-29-2019 7:31:50 AM

5011 Views
3 replies
1 kudos

Resolved! JavaPackage' object is not callable :: Mlean

Hi Folks, We are working on production Databricks project using Mleap. when run below code on databricks, it throws error like " 'JavaPackage' object is not callable" code :import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSer...

Data Engineering

5011 Views
3 replies
1 kudos

10-29-2019 7:31:50 AM

View Replies

Latest Reply

Seenu45
New Contributor II

10-30-2019 1:41:57 AM

1 kudos

Thanks syamspr. it is working now.

1 kudos

10-30-2019 1:41:57 AM

2 More Replies

by pepevo • New Contributor III

02-10-2020 7:23:36 AM

11177 Views
10 replies
0 kudos

Resolved! How to convert column type from decimal to date in sparksql

I need to convert column type from decimal to date in sparksql when the format is not yyyy-mm-dd? A table contains column data declared as decimal (38,0) and data is in yyyymmdd format and I am unable to run sql queries on it in databrick notebook. ...

Data Engineering

11177 Views
10 replies
0 kudos

02-10-2020 7:23:36 AM

View Replies

Latest Reply

pepevo
New Contributor III

02-13-2020 11:35:35 AM

0 kudos

thank you Tom. I made it work already.

0 kudos

02-13-2020 11:35:35 AM

9 More Replies

by ArielHerrera • New Contributor II

02-13-2019 11:29:02 AM

14805 Views
5 replies
2 kudos

Resolved! How to display SHAP plots?

I am looking to display SHAP plots, here is the code:import xgboost import shap shap.initjs() # load JS visualization code to notebookX,y = shap.datasets.boston() # train XGBoost model model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatri...

Data Engineering

14805 Views
5 replies
2 kudos

02-13-2019 11:29:02 AM

View Replies

Latest Reply

lrnzcig
New Contributor II

02-05-2020 8:07:29 AM

2 kudos

As @Vinh dqvinh87 noted, the accepted solution only works for force_plot. For other plots, the following trick works for me:import matplotlib.pyplot as plt p = shap.summary_plot(shap_values, test_df, show=False) display(p)

2 kudos

02-05-2020 8:07:29 AM

4 More Replies

by Muthu145 • New Contributor

12-19-2016 4:50:09 PM

10519 Views
3 replies
0 kudos

KNN classifier on Spark

Hi Team , Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset. Even I want to validate the KNN model with the testing dataset. I tried to use scikit learn but the program is runn...

Data Engineering

10519 Views
3 replies
0 kudos

12-19-2016 4:50:09 PM

View Replies

Latest Reply

SouravSaha
New Contributor II

02-04-2020 6:31:46 PM

0 kudos

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same. Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_demo.py It works on a distributed framework (...

0 kudos

02-04-2020 6:31:46 PM

2 More Replies

by zhaoxuan210 • New Contributor

01-16-2020 9:12:50 AM

19125 Views
1 replies
0 kudos

How can I read all the files in a folder on S3 into several pandas dataframes?

import pandas as pd import glob path = "s3://somewhere/" # use your path all_files = glob.glob(path + "/*.csv") print(all_files) li = [] for filename in all_files: dfi = pd.read_csv(filename,names =['acct_id', 'SOR_ID'], dtype={'acct_id':str,...

Data Engineering

19125 Views
1 replies
0 kudos

01-16-2020 9:12:50 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

01-26-2020 10:03:43 PM

0 kudos

Hi @zhaoxuan210, Please go through the below answer,https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3

0 kudos

01-26-2020 10:03:43 PM

by cfregly • Contributor

02-24-2015 3:40:15 PM

3993 Views
6 replies
0 kudos

Should I always cache my RDD's and DataFrames?

Data Engineering

3993 Views
6 replies
0 kudos

02-24-2015 3:40:15 PM

View Replies

Latest Reply

ThomasDecaux
New Contributor II

04-11-2017 2:24:38 AM

0 kudos

Hello, What is most efficient between RDD and DataFrame ? (I mean better to cache, consume less memory) Thanks you,

0 kudos

04-11-2017 2:24:38 AM

5 More Replies

by Van-DuyetLe • New Contributor

03-21-2018 12:38:32 AM

25422 Views
5 replies
1 kudos

What's the difference between Interactive Clusters and Job Cluster?

I am new to databricks. I would like to know what is the difference between Interactive Clusters and Job Cluster? There are no official document now.

Data Engineering

25422 Views
5 replies
1 kudos

03-21-2018 12:38:32 AM

View Replies

Latest Reply

Forum_Admin
Contributor

01-19-2020 6:53:28 AM

1 kudos

Sports news Football news International football news Football news Thai football news, Thai football Follow news, know sports news at Siamsportnews

1 kudos

01-19-2020 6:53:28 AM

4 More Replies

by User16301467532 • New Contributor II

07-15-2015 11:45:24 AM

18168 Views
9 replies
1 kudos

How can I change the parquet compression algorithm from gzip to something else?

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

Data Engineering

18168 Views
9 replies
1 kudos

07-15-2015 11:45:24 AM

View Replies

Latest Reply

ZhenZeng
New Contributor II

10-01-2019 2:10:05 AM

1 kudos

spark.sql("set spark.sql.parquet.compression.codec=gzip");

1 kudos

10-01-2019 2:10:05 AM

8 More Replies

by Venkata_Krishna • New Contributor

01-13-2020 11:04:43 AM

8119 Views
1 replies
0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

Data Engineering

8119 Views
1 replies
0 kudos

01-13-2020 11:04:43 AM

View Replies

Latest Reply

lee
Contributor

01-13-2020 5:17:35 PM

0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

0 kudos

01-13-2020 5:17:35 PM

by MithuWagh • New Contributor

12-24-2019 4:14:09 AM

5687 Views
1 replies
0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

Data Engineering

5687 Views
1 replies
0 kudos

12-24-2019 4:14:09 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-30-2019 3:27:03 AM

0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

0 kudos

12-30-2019 3:27:03 AM

by KrisMusial • New Contributor

07-31-2016 11:07:50 AM

5545 Views
2 replies
0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

Data Engineering

5545 Views
2 replies
0 kudos

07-31-2016 11:07:50 AM

View Replies

Latest Reply

Guru421421
New Contributor II

12-20-2019 1:59:16 PM

0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

0 kudos

12-20-2019 1:59:16 PM

1 More Replies

by NandhaKumar • New Contributor II

11-14-2019 2:34:57 AM

3912 Views
3 replies
0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

Data Engineering

3912 Views
3 replies
0 kudos

11-14-2019 2:34:57 AM

View Replies

Latest Reply

shyam_9
Valued Contributor

11-17-2019 9:46:20 PM

0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

0 kudos

11-17-2019 9:46:20 PM

2 More Replies

by cfregly • Contributor

05-26-2015 11:38:48 AM

3097 Views
4 replies
0 kudos

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

Data Engineering

3097 Views
4 replies
0 kudos

05-26-2015 11:38:48 AM

View Replies

Latest Reply

GeethGovindSrin
New Contributor II

12-19-2019 2:47:04 AM

0 kudos

@cfregly : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

0 kudos

12-19-2019 2:47:04 AM

3 More Replies

by tourist_on_road • New Contributor

12-12-2019 4:47:16 PM

4978 Views
1 replies
0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

Data Engineering

4978 Views
1 replies
0 kudos

12-12-2019 4:47:16 PM

View Replies

Latest Reply

shyam_9
Valued Contributor

12-16-2019 10:00:26 PM

0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

0 kudos

12-16-2019 10:00:26 PM

User

Count

1602

737

348

285

247

Databricks Community

Forum Posts

FileNotFoundError: [Errno 2] No such file or directory or IsADirectoryError: [Errno 21] Is a directory

Resolved! JavaPackage' object is not callable :: Mlean

Resolved! How to convert column type from decimal to date in sparksql

Resolved! How to display SHAP plots?

KNN classifier on Spark

How can I read all the files in a folder on S3 into several pandas dataframes?

Should I always cache my RDD's and DataFrames?

What's the difference between Interactive Clusters and Job Cluster?

How can I change the parquet compression algorithm from gzip to something else?

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to deal with column name with .(dot) in pyspark dataframe??

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

How do I group my dataset by a key or combination of keys without doing any aggregations using RDDs, DataFrames, and SQL?

How to read binary data in pyspark

Getting com.databricks.client.jdbc.Driver is not f...

Unit Testing DLT Pipelines

Retrieve job-level parameters in spark_python_task...

Cannot pass arrays to spark.sql() using named para...

unity catalog with external table and column maski...