cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Nandini
by New Contributor II
  • 10018 Views
  • 10 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 10018 Views
  • 10 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
9 More Replies
GuMart
by New Contributor III
  • 1722 Views
  • 2 replies
  • 1 kudos

Delta Live Tables - RETRY_ON_FAILURE

Hi,Is it possible to set it up the RETRY_ON_FAILURE property for DLTs through the API?I'm not finding in the Docs (although it seems to exist in a response payload).https://docs.databricks.com/delta-live-tables/api-guide.html

  • 1722 Views
  • 2 replies
  • 1 kudos
Latest Reply
GuMart
New Contributor III
  • 1 kudos

Hi @Suteja Kanuri​ ,Thank you so much for the quick and complete answer!Regards,

  • 1 kudos
1 More Replies
alm
by New Contributor III
  • 3759 Views
  • 2 replies
  • 2 kudos

Resolved! Vectorized reading of parquet file containing decimal type column(s)

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vector...

  • 3759 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Alberte Mørk​ :The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for th...

  • 2 kudos
1 More Replies
Anonymous
by Not applicable
  • 1126 Views
  • 2 replies
  • 2 kudos

Hello Everyone, I'm interested to learn about the certifications you're pursuing to enhance your skills. Sharing your goals can inspire those ...

Hello Everyone,I'm interested to learn about the certifications you're pursuing to enhance your skills. Sharing your goals can inspire those who may have started their certification journey but struggled with motivation. Personally, I recently comple...

  • 1126 Views
  • 2 replies
  • 2 kudos
Latest Reply
FJ
Contributor III
  • 2 kudos

I'm trying the Data Engineering professional exam at the end of the month. It's like a shot in the dark because no practice exams stop are available and from what I've seen online from people who already passed it, the Advanced Data Engineering with ...

  • 2 kudos
1 More Replies
Anonymous
by Not applicable
  • 6042 Views
  • 8 replies
  • 0 kudos

Not able to connect to On-Prem Oracle from Databricks cluster

Hi Everyone,I was trying to connect to Oracle Instance from Databricks cluster and it is giving below error:java.sql.SQLTimeoutException: ORA-12170: Cannot connect. TCP connect timeout of 30000ms for host xx.x.x.*** port 1521. (CONNECTION_ID=CgM7V7UB...

  • 6042 Views
  • 8 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Satya89:The error message you received indicates that the TCP connection to the Oracle database timed out. This could be caused by a number of factors such as network issues, firewall restrictions, or the database being overloaded.Here are a few ste...

  • 0 kudos
7 More Replies
rusty9876543
by New Contributor II
  • 5764 Views
  • 5 replies
  • 2 kudos

Split dataFrame into 1MB chunks and create a single json array with each row in chunk being an array element

Hi, I have a dataFrame that I've been able to convert into a struct with each row being a JSON object.I want the ability to split the data frame into 1MB chunks. Once I have the chunks, I would like to add all rows in each respective chunk into a sin...

  • 5764 Views
  • 5 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Tamoor Mirza​ :You can use the to_json method of a DataFrame to convert each chunk to a JSON string, and then append those JSON strings to a list. Here is an example code snippet that splits a DataFrame into 1MB chunks and creates a list of JSON arr...

  • 2 kudos
4 More Replies
Hansjoerg
by New Contributor
  • 1490 Views
  • 2 replies
  • 0 kudos

Resolved! Is Azure AD Conditional Access also possible for the Databricks Account Console?

I wonder whether conditional access in Azure AD for Databricks (https://learn.microsoft.com/en-us/azure/databricks/administration-guide/access-control/conditional-access?source=docs) can be configured separately for the account console (https://accou...

  • 1490 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Hansjörg Wingeier​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

  • 0 kudos
1 More Replies
tototox
by New Contributor III
  • 7160 Views
  • 3 replies
  • 0 kudos

Using dbutils.fs.ls gives overlap error.

I created a schema with that route as a managed location.(abfss://~~@~~.dfs.core.windows.net/dejeong)And an external table named 'first_table' was created in the corresponding path.(abfss://~~@~~.dfs.core.windows.net/dejeong/first_table)​The results ...

  • 7160 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @jin park​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we c...

  • 0 kudos
2 More Replies
Pien
by New Contributor II
  • 4077 Views
  • 2 replies
  • 0 kudos

Resolved! Change data format in an existing DB table

I got errors of incompatible filetypes while converting to pyspark df, so I changed all columns to string types. Now I'm trying to add this df to an existing table (where not everything was a string type). And I'm getting an error of incompatible da...

error
  • 4077 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pien Derkx​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we...

  • 0 kudos
1 More Replies
kaileena
by New Contributor
  • 1991 Views
  • 2 replies
  • 0 kudos

Error in library(RMySQL): there is no package called ‘RMySQL’

i tried to install RMySQL on databricks like this:install.packages("RMySQL")i got this error:Installing package into ‘/local_disk0/.ephemeral_nfs/envs/rEnv-c677bc4c-e6a3-40df-a5ab-bfd5d277e0c0’ (as ‘lib’ is unspecified) Warning: unable to access inde...

  • 1991 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @miru miro​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 0 kudos
1 More Replies
hvsk
by New Contributor
  • 10708 Views
  • 2 replies
  • 0 kudos

Using a Virtual environment

Hi All,We are working on training NHits/TFT (a Pytorch-forecasting implementation) for timeseries forecasting. However, we are having some issues with package dependency conflicts.Is there a way to consistently use a virtual environment across cells ...

  • 10708 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Harsh Kalra​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so w...

  • 0 kudos
1 More Replies
JLCDA
by New Contributor
  • 2083 Views
  • 2 replies
  • 0 kudos

databricks-connect 9.1 : StreamCorruptedException: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

Hello, I'm using databricks-connect 9.1 and I started having issues since last week in all functions that have a "collect()". Everything was working before : myList = df1.select("id").rdd.flatMap(lambda x: x).collect()here the error : py4j.protocol.P...

  • 2083 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Julien Larcher​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...

  • 0 kudos
1 More Replies
hfrid
by New Contributor II
  • 4510 Views
  • 1 replies
  • 2 kudos

JDBC connector seems to be a bottleneck when trying to insert dataframe to Azure SQL Server

Hi! I am inserting a pyspark dataframe to Azure sql server and it takes a very long time. The database is a s4 but my dataframe that is 17 million rows and 30 columns takes up to 50 minutes to insert.Is there a way to significantly speed this up? I a...

  • 4510 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Hjalmar Friden​ :There are several ways to improve the performance of inserting data into Azure SQL Server using JDBC connector:Increase the batch size: By default, the JDBC connector sends data in batches of 1000 rows at a time. You can increase th...

  • 2 kudos
Anonymous
by Not applicable
  • 6613 Views
  • 1 replies
  • 1 kudos

Testing framework using Databricks Notebook and Pytest.

Hi Friends,I am designing a Testing framework using Databricks and pytest. Currently stuck with report generation, that is generating blank with only default parameters only .for ex :-testsuites><testsuite name="pytest" errors="0" failures="0" skippe...

  • 6613 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Vijaya Palreddy​ :There are several testing frameworks available for data testing that you can consider using with Databricks and Pytest:Great Expectations: Great Expectations is an open-source framework that provides a simple way to create and main...

  • 1 kudos
gary7135
by New Contributor II
  • 1274 Views
  • 1 replies
  • 0 kudos

Unable to use GridsearchCV from spark-sklearn due to 'fit_params' error

When using GridsearchCV from spark-sklearn, I got GridSearchCV giving " __init__() got an unexpected keyword argument 'fit_params' errorI am using sklearn 1.2.2 and spark-sklearn 0.3.0I think this is because spark-sklearn GridsearchCV still has the f...

  • 1274 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Gary Mu​ :Yes, you are correct. The error message you are seeing is likely due to the fact that the fit_params parameter was deprecated in GridSearchCV in sklearn 1.2.2. One possible solution is to use a different version of scikit-learn that is co...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels