cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

AnaDel_Campo_Me
by New Contributor
  • 10340 Views
  • 2 replies
  • 1 kudos

FileNotFoundError: [Errno 2] No such file or directory or IsADirectoryError: [Errno 21] Is a directory

I have been trying to open a file on the dbfs using all different combinations: if I use the following code: with open("/dbfs/FileStore/df/Downloadedfile.csv", 'r', newline='') as f I get IsADirectoryError: [Errno 21] Is a directory with open("dbfs:...

  • 10340 Views
  • 2 replies
  • 1 kudos
Latest Reply
paulmark
New Contributor II
  • 1 kudos

To get rid of this error you can try using Python file exists methods to check that at least python sees the file exists or not. In other words, you can make sure that the user has indeed typed a correct path for a real existing file. If the user do...

  • 1 kudos
1 More Replies
Seenu45
by New Contributor II
  • 5011 Views
  • 3 replies
  • 1 kudos

Resolved! JavaPackage' object is not callable :: Mlean

Hi Folks, We are working on production Databricks project using Mleap. when run below code on databricks, it throws error like " 'JavaPackage' object is not callable" code :import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSer...

  • 5011 Views
  • 3 replies
  • 1 kudos
Latest Reply
Seenu45
New Contributor II
  • 1 kudos

Thanks syamspr. it is working now.

  • 1 kudos
2 More Replies
pepevo
by New Contributor III
  • 11177 Views
  • 10 replies
  • 0 kudos

Resolved! How to convert column type from decimal to date in sparksql

I need to convert column type from decimal to date in sparksql when the format is not yyyy-mm-dd? A table contains column data declared as decimal (38,0) and data is in yyyymmdd format and I am unable to run sql queries on it in databrick notebook. ...

  • 11177 Views
  • 10 replies
  • 0 kudos
Latest Reply
pepevo
New Contributor III
  • 0 kudos

thank you Tom. I made it work already.

  • 0 kudos
9 More Replies
ArielHerrera
by New Contributor II
  • 14805 Views
  • 5 replies
  • 2 kudos

Resolved! How to display SHAP plots?

I am looking to display SHAP plots, here is the code:import xgboost import shap shap.initjs() # load JS visualization code to notebookX,y = shap.datasets.boston() # train XGBoost model model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatri...

0693f000007OoIfAAK
  • 14805 Views
  • 5 replies
  • 2 kudos
Latest Reply
lrnzcig
New Contributor II
  • 2 kudos

As @Vinh dqvinh87​  noted, the accepted solution only works for force_plot. For other plots, the following trick works for me:import matplotlib.pyplot as plt p = shap.summary_plot(shap_values, test_df, show=False) display(p)

  • 2 kudos
4 More Replies
Muthu145
by New Contributor
  • 10519 Views
  • 3 replies
  • 0 kudos

KNN classifier on Spark

Hi Team , Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset. Even I want to validate the KNN model with the testing dataset. I tried to use scikit learn but the program is runn...

  • 10519 Views
  • 3 replies
  • 0 kudos
Latest Reply
SouravSaha
New Contributor II
  • 0 kudos

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same. Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_demo.py It works on a distributed framework (...

  • 0 kudos
2 More Replies
zhaoxuan210
by New Contributor
  • 19125 Views
  • 1 replies
  • 0 kudos

How can I read all the files in a folder on S3 into several pandas dataframes?

import pandas as pd import glob path = "s3://somewhere/" # use your path all_files = glob.glob(path + "/*.csv") print(all_files) li = [] for filename in all_files: dfi = pd.read_csv(filename,names =['acct_id', 'SOR_ID'], dtype={'acct_id':str,...

  • 19125 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @zhaoxuan210, Please go through the below answer,https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3

  • 0 kudos
Van-DuyetLe
by New Contributor
  • 25422 Views
  • 5 replies
  • 1 kudos

What's the difference between Interactive Clusters and Job Cluster?

I am new to databricks. I would like to know what is the difference between Interactive Clusters and Job Cluster? There are no official document now.

  • 25422 Views
  • 5 replies
  • 1 kudos
Latest Reply
Forum_Admin
Contributor
  • 1 kudos

Sports news Football news International football news Football news Thai football news, Thai football Follow news, know sports news at Siamsportnews

  • 1 kudos
4 More Replies
User16301467532
by New Contributor II
  • 18168 Views
  • 9 replies
  • 1 kudos

How can I change the parquet compression algorithm from gzip to something else?

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

  • 18168 Views
  • 9 replies
  • 1 kudos
Latest Reply
ZhenZeng
New Contributor II
  • 1 kudos

spark.sql("set spark.sql.parquet.compression.codec=gzip");

  • 1 kudos
8 More Replies
Venkata_Krishna
by New Contributor
  • 8119 Views
  • 1 replies
  • 0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

  • 8119 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

  • 0 kudos
MithuWagh
by New Contributor
  • 5687 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 5687 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos
KrisMusial
by New Contributor
  • 5545 Views
  • 2 replies
  • 0 kudos

Resolved! Saving to parquet with SaveMode.Overwrite throws exception

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success. I minimized the code and reproduced the issue with the following two cells: > case class MyClass(val fld1: Integer, val fld2: Integer) > > val lst1 = sc.paralle...

  • 5545 Views
  • 2 replies
  • 0 kudos
Latest Reply
Guru421421
New Contributor II
  • 0 kudos

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

  • 0 kudos
1 More Replies
NandhaKumar
by New Contributor II
  • 3912 Views
  • 3 replies
  • 0 kudos

How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in --py-files present in dbfs: .

I have created a databricks in azure. I have created a cluster for python 3. I am creating a job using spark-submit parameters. How to specify multiple files in --py-files in spark-submit command for databricks job? All the files to be specified in ...

  • 3912 Views
  • 3 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Nandha Kumar,please go through the below docs to pass python files as job,https://docs.databricks.com/dev-tools/api/latest/jobs.html#sparkpythontask

  • 0 kudos
2 More Replies
cfregly
by Contributor
  • 3097 Views
  • 4 replies
  • 0 kudos
  • 3097 Views
  • 4 replies
  • 0 kudos
Latest Reply
GeethGovindSrin
New Contributor II
  • 0 kudos

@cfregly​ : For DataFrames, you can use the following code for using groupBy without aggregations.Df.groupBy(Df["column_name"]).agg({})

  • 0 kudos
3 More Replies
tourist_on_road
by New Contributor
  • 4978 Views
  • 1 replies
  • 0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

  • 4978 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

  • 0 kudos
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels
Top Kudoed Authors