cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

cfregly
by Contributor
  • 5331 Views
  • 5 replies
  • 0 kudos
  • 5331 Views
  • 5 replies
  • 0 kudos
Latest Reply
srisre111
New Contributor II
  • 0 kudos

I am trying to store a dataframe as table in databricks and encountering the following error, can someone help? "typeerror: field date: can not merge type <class 'pyspark.sql.types.stringtype'> and <class 'pyspark.sql.types.doubletype'>"

  • 0 kudos
4 More Replies
dhanunjaya
by New Contributor II
  • 7679 Views
  • 6 replies
  • 0 kudos

how to remove empty rows from the data frame.

lets assume if i have 10 columns in a data frame,all 10 columns has empty values for 100 rows out of 200 rows, how i can skip the empty rows?

  • 7679 Views
  • 6 replies
  • 0 kudos
Latest Reply
GaryDiaz
New Contributor II
  • 0 kudos

you can try this: df.na.drop(how = "all"), this will remove the row only if all the rows are null or NaN

  • 0 kudos
5 More Replies
AlaQabaja
by New Contributor II
  • 5068 Views
  • 3 replies
  • 0 kudos

Get last modified date or create date for azure blob container

Hi Everyone, I am trying to implement a way in Python to only read files that weren't loaded since the last run of my notebook. The way I am thinking of implementing this is to keep of the last time my notebook has finished in a database table. Nex...

  • 5068 Views
  • 3 replies
  • 0 kudos
Latest Reply
Forum_Admin
Contributor
  • 0 kudos

Hello! I just wanted to share my point of view on the topic of dating sites. I have been looking for a decent Asian catch-up site for a very long time, in addition to them I found https://hookupsearch.org/asian-hookup-sites/. We definitely recommend...

  • 0 kudos
2 More Replies
smanickam
by New Contributor II
  • 15996 Views
  • 5 replies
  • 3 kudos

com.databricks.sql.io.FileReadException: Error while reading file dbfs:

I ran the below statement and got the error %python data = sqlContext.read.parquet("/FileStore/tables/ganesh.parquet") display(data) Error: SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 1 times, most recent failure:...

  • 15996 Views
  • 5 replies
  • 3 kudos
Latest Reply
MatthewSzafir
New Contributor III
  • 3 kudos

I'm having a similar issue reading a JSON file. It is ~550MB compressed and is on a single line: val cfilename = "c_datafeed_20200128.json.gz" val events = spark.read.json(s"/mnt/c/input1/$cfilename") display(events) The filename is correct and t...

  • 3 kudos
4 More Replies
AnaDel_Campo_Me
by New Contributor
  • 11294 Views
  • 2 replies
  • 1 kudos

FileNotFoundError: [Errno 2] No such file or directory or IsADirectoryError: [Errno 21] Is a directory

I have been trying to open a file on the dbfs using all different combinations: if I use the following code: with open("/dbfs/FileStore/df/Downloadedfile.csv", 'r', newline='') as f I get IsADirectoryError: [Errno 21] Is a directory with open("dbfs:...

  • 11294 Views
  • 2 replies
  • 1 kudos
Latest Reply
paulmark
New Contributor II
  • 1 kudos

To get rid of this error you can try using Python file exists methods to check that at least python sees the file exists or not. In other words, you can make sure that the user has indeed typed a correct path for a real existing file. If the user do...

  • 1 kudos
1 More Replies
Seenu45
by New Contributor II
  • 5789 Views
  • 3 replies
  • 1 kudos

Resolved! JavaPackage' object is not callable :: Mlean

Hi Folks, We are working on production Databricks project using Mleap. when run below code on databricks, it throws error like " 'JavaPackage' object is not callable" code :import mleap.pyspark from mleap.pyspark.spark_support import SimpleSparkSer...

  • 5789 Views
  • 3 replies
  • 1 kudos
Latest Reply
Seenu45
New Contributor II
  • 1 kudos

Thanks syamspr. it is working now.

  • 1 kudos
2 More Replies
pepevo
by New Contributor III
  • 14173 Views
  • 10 replies
  • 0 kudos

Resolved! How to convert column type from decimal to date in sparksql

I need to convert column type from decimal to date in sparksql when the format is not yyyy-mm-dd? A table contains column data declared as decimal (38,0) and data is in yyyymmdd format and I am unable to run sql queries on it in databrick notebook. ...

  • 14173 Views
  • 10 replies
  • 0 kudos
Latest Reply
pepevo
New Contributor III
  • 0 kudos

thank you Tom. I made it work already.

  • 0 kudos
9 More Replies
ArielHerrera
by New Contributor II
  • 17388 Views
  • 5 replies
  • 2 kudos

Resolved! How to display SHAP plots?

I am looking to display SHAP plots, here is the code:import xgboost import shap shap.initjs() # load JS visualization code to notebookX,y = shap.datasets.boston() # train XGBoost model model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatri...

0693f000007OoIfAAK
  • 17388 Views
  • 5 replies
  • 2 kudos
Latest Reply
lrnzcig
New Contributor II
  • 2 kudos

As @Vinh dqvinh87​  noted, the accepted solution only works for force_plot. For other plots, the following trick works for me:import matplotlib.pyplot as plt p = shap.summary_plot(shap_values, test_df, show=False) display(p)

  • 2 kudos
4 More Replies
Muthu145
by New Contributor
  • 13243 Views
  • 3 replies
  • 0 kudos

KNN classifier on Spark

Hi Team , Can you please help me in implementing KNN classifer in pyspark using distributed architecture and processing the dataset. Even I want to validate the KNN model with the testing dataset. I tried to use scikit learn but the program is runn...

  • 13243 Views
  • 3 replies
  • 0 kudos
Latest Reply
SouravSaha
New Contributor II
  • 0 kudos

Hey, about about using NEC Frovedis (https://github.com/frovedis/frovedis) framework for the same. Refer: https://github.com/frovedis/frovedis/blob/master/src/foreign_if/python/examples/unsupervised_knn_demo.py It works on a distributed framework (...

  • 0 kudos
2 More Replies
zhaoxuan210
by New Contributor
  • 22017 Views
  • 1 replies
  • 0 kudos

How can I read all the files in a folder on S3 into several pandas dataframes?

import pandas as pd import glob path = "s3://somewhere/" # use your path all_files = glob.glob(path + "/*.csv") print(all_files) li = [] for filename in all_files: dfi = pd.read_csv(filename,names =['acct_id', 'SOR_ID'], dtype={'acct_id':str,...

  • 22017 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @zhaoxuan210, Please go through the below answer,https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3

  • 0 kudos
Van-DuyetLe
by New Contributor
  • 35630 Views
  • 5 replies
  • 3 kudos

What's the difference between Interactive Clusters and Job Cluster?

I am new to databricks. I would like to know what is the difference between Interactive Clusters and Job Cluster? There are no official document now.

  • 35630 Views
  • 5 replies
  • 3 kudos
Latest Reply
Forum_Admin
Contributor
  • 3 kudos

Sports news Football news International football news Football news Thai football news, Thai football Follow news, know sports news at Siamsportnews

  • 3 kudos
4 More Replies
User16301467532
by New Contributor II
  • 21543 Views
  • 9 replies
  • 1 kudos

How can I change the parquet compression algorithm from gzip to something else?

Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4.

  • 21543 Views
  • 9 replies
  • 1 kudos
Latest Reply
ZhenZeng
New Contributor II
  • 1 kudos

spark.sql("set spark.sql.parquet.compression.codec=gzip");

  • 1 kudos
8 More Replies
Venkata_Krishna
by New Contributor
  • 9222 Views
  • 1 replies
  • 0 kudos

convert string dataframe column MM/dd/yyyy hh:mm:ss AM/PM to timestamp MM-dd-yyyy hh:mm:ss

How to convert string 6/3/2019 5:06:00 AM to timestamp in 24 hour format MM-dd-yyyy hh:mm:ss in python spark.

  • 9222 Views
  • 1 replies
  • 0 kudos
Latest Reply
lee
Contributor
  • 0 kudos

You would use a combination of the functions: pyspark.sql.functions.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss') (documentation) and pyspark.sql.functions.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss') (documentation)from pysp...

  • 0 kudos
MithuWagh
by New Contributor
  • 7116 Views
  • 1 replies
  • 0 kudos

How to deal with column name with .(dot) in pyspark dataframe??

We are streaming data from kafka source with json but in some column we are getting .(dot) in column names.streaming json data: df1 = df.selectExpr("CAST(value AS STRING)") {"pNum":"A14","from":"telecom","payload":{"TARGET":"1","COUNTRY":"India"...

  • 7116 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Databricks Employee
  • 0 kudos

Hi @Mithu Wagh you can use backticks to enclose the column name.df.select("`col0.1`")

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels