cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

rusty9876543
by New Contributor II
  • 6202 Views
  • 5 replies
  • 2 kudos

Split dataFrame into 1MB chunks and create a single json array with each row in chunk being an array element

Hi, I have a dataFrame that I've been able to convert into a struct with each row being a JSON object.I want the ability to split the data frame into 1MB chunks. Once I have the chunks, I would like to add all rows in each respective chunk into a sin...

  • 6202 Views
  • 5 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Tamoor Mirza​ :You can use the to_json method of a DataFrame to convert each chunk to a JSON string, and then append those JSON strings to a list. Here is an example code snippet that splits a DataFrame into 1MB chunks and creates a list of JSON arr...

  • 2 kudos
4 More Replies
Hansjoerg
by New Contributor
  • 1635 Views
  • 2 replies
  • 0 kudos

Resolved! Is Azure AD Conditional Access also possible for the Databricks Account Console?

I wonder whether conditional access in Azure AD for Databricks (https://learn.microsoft.com/en-us/azure/databricks/administration-guide/access-control/conditional-access?source=docs) can be configured separately for the account console (https://accou...

  • 1635 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Hansjörg Wingeier​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

  • 0 kudos
1 More Replies
tototox
by New Contributor III
  • 7303 Views
  • 3 replies
  • 0 kudos

Using dbutils.fs.ls gives overlap error.

I created a schema with that route as a managed location.(abfss://~~@~~.dfs.core.windows.net/dejeong)And an external table named 'first_table' was created in the corresponding path.(abfss://~~@~~.dfs.core.windows.net/dejeong/first_table)​The results ...

  • 7303 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @jin park​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we c...

  • 0 kudos
2 More Replies
Pien
by New Contributor II
  • 4405 Views
  • 2 replies
  • 0 kudos

Resolved! Change data format in an existing DB table

I got errors of incompatible filetypes while converting to pyspark df, so I changed all columns to string types. Now I'm trying to add this df to an existing table (where not everything was a string type). And I'm getting an error of incompatible da...

error
  • 4405 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pien Derkx​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we...

  • 0 kudos
1 More Replies
kaileena
by New Contributor
  • 2133 Views
  • 2 replies
  • 0 kudos

Error in library(RMySQL): there is no package called ‘RMySQL’

i tried to install RMySQL on databricks like this:install.packages("RMySQL")i got this error:Installing package into ‘/local_disk0/.ephemeral_nfs/envs/rEnv-c677bc4c-e6a3-40df-a5ab-bfd5d277e0c0’ (as ‘lib’ is unspecified) Warning: unable to access inde...

  • 2133 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @miru miro​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 0 kudos
1 More Replies
hvsk
by New Contributor
  • 11323 Views
  • 2 replies
  • 0 kudos

Using a Virtual environment

Hi All,We are working on training NHits/TFT (a Pytorch-forecasting implementation) for timeseries forecasting. However, we are having some issues with package dependency conflicts.Is there a way to consistently use a virtual environment across cells ...

  • 11323 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Harsh Kalra​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so w...

  • 0 kudos
1 More Replies
JLCDA
by New Contributor
  • 2191 Views
  • 2 replies
  • 0 kudos

databricks-connect 9.1 : StreamCorruptedException: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

Hello, I'm using databricks-connect 9.1 and I started having issues since last week in all functions that have a "collect()". Everything was working before : myList = df1.select("id").rdd.flatMap(lambda x: x).collect()here the error : py4j.protocol.P...

  • 2191 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Julien Larcher​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answer...

  • 0 kudos
1 More Replies
hfrid
by New Contributor II
  • 5027 Views
  • 1 replies
  • 2 kudos

JDBC connector seems to be a bottleneck when trying to insert dataframe to Azure SQL Server

Hi! I am inserting a pyspark dataframe to Azure sql server and it takes a very long time. The database is a s4 but my dataframe that is 17 million rows and 30 columns takes up to 50 minutes to insert.Is there a way to significantly speed this up? I a...

  • 5027 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Hjalmar Friden​ :There are several ways to improve the performance of inserting data into Azure SQL Server using JDBC connector:Increase the batch size: By default, the JDBC connector sends data in batches of 1000 rows at a time. You can increase th...

  • 2 kudos
Anonymous
by Not applicable
  • 6747 Views
  • 1 replies
  • 1 kudos

Testing framework using Databricks Notebook and Pytest.

Hi Friends,I am designing a Testing framework using Databricks and pytest. Currently stuck with report generation, that is generating blank with only default parameters only .for ex :-testsuites><testsuite name="pytest" errors="0" failures="0" skippe...

  • 6747 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Vijaya Palreddy​ :There are several testing frameworks available for data testing that you can consider using with Databricks and Pytest:Great Expectations: Great Expectations is an open-source framework that provides a simple way to create and main...

  • 1 kudos
gary7135
by New Contributor II
  • 1384 Views
  • 1 replies
  • 0 kudos

Unable to use GridsearchCV from spark-sklearn due to 'fit_params' error

When using GridsearchCV from spark-sklearn, I got GridSearchCV giving " __init__() got an unexpected keyword argument 'fit_params' errorI am using sklearn 1.2.2 and spark-sklearn 0.3.0I think this is because spark-sklearn GridsearchCV still has the f...

  • 1384 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Gary Mu​ :Yes, you are correct. The error message you are seeing is likely due to the fact that the fit_params parameter was deprecated in GridSearchCV in sklearn 1.2.2. One possible solution is to use a different version of scikit-learn that is co...

  • 0 kudos
709986
by New Contributor
  • 770 Views
  • 1 replies
  • 0 kudos

Not able to connect with Salesforce, We need to read data from Salesforce

Not able to connect with Salesforce, We need to read data from Salesforce, we are getting NoClassDefFoundError: scala/Product$classCode:%scalaval sfDF = spark.        read.        format("com.springml.spark.salesforce").        option("username", "sf...

  • 770 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Amar.Kasar​ :The error you are getting, NoClassDefFoundError: scala/Product$class, suggests that the Scala classpath is not set up correctly. You can try the following steps to troubleshoot the issue:Check if the library com.springml:spark-salesforc...

  • 0 kudos
cmilligan
by Contributor II
  • 7792 Views
  • 1 replies
  • 0 kudos

Pull query that inserts into table

I'm trying to pull some data down for table history and am needing to view the query that inserted into a table. My team owns the process so I'm able to view the current query by just viewing it but I'm also wanting to capture changes over time witho...

  • 7792 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Coleman Milligan​ :Yes, in Databricks, you can use the built-in Delta Lake feature to track the history of changes made to a table, including the queries that inserted data into it.Here's an example of how to retrieve the queries that inserted data ...

  • 0 kudos
Arty
by New Contributor II
  • 5493 Views
  • 5 replies
  • 6 kudos

Resolved! How to make Autoloader delete files after a successful load

Hi AllCan you please advise how I can arrange loaded file deletion from Azure Storage upon its successful load via Autoloader? As I understood, Spark streaming "cleanSource" option is unavailable for Autoloader, so I'm trying to find the best way to ...

  • 5493 Views
  • 5 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @Artem Sachuk​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 6 kudos
4 More Replies
Julie1
by New Contributor II
  • 6778 Views
  • 2 replies
  • 1 kudos

Resolved! Query data not showing in custom alert notifications and QUERY_RESULT_ROWS

I've set up a custom alert notification for one of my Databricks SQL queries, and it triggers correctly, but I'm not able to get the actual results of the query to appear in the notification email. I've followed the example/template in the custom ale...

  • 6778 Views
  • 2 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

The actual query results are not displayed in the alert unfortunately. You can pass the alert condition etc, but not the raw results of the underlying query.I hope this will be added in the future.A workaround is to add a link to the query, so the r...

  • 1 kudos
1 More Replies
Mado
by Valued Contributor II
  • 3119 Views
  • 4 replies
  • 3 kudos

Resolved! Streaming Delta Live Table, if I re-run the pipeline, does it append the new data to the current table?

Hi,I have a question about DLT table. Assume that I have a streaming DLT pipeline which reads data from a Bronze table and apply transformation on data. Pipeline mode is triggered. If I re-run the pipeline, does it append new data to the current tabl...

  • 3119 Views
  • 4 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@Mohammad Saber​ :In a Databricks Delta Lake (DLT) pipeline, when you re-run the pipeline in "append" mode, new data will be appended to the existing table. Delta Lake provides built-in support for handling duplicates through its "upsert" functionali...

  • 3 kudos
3 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels