cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Brad
by Contributor
  • 990 Views
  • 2 replies
  • 2 kudos

Colon sign operator for JSON

Hi,I have a streaming source loading data to a raw table, which has a string type col (whose value is JSON) to hold all data. I want to use colon sign operator to get fields from the JSON string. Is this going to have some perf issues vs. I use a sch...

  • 990 Views
  • 2 replies
  • 2 kudos
Latest Reply
Brad
Contributor
  • 2 kudos

Thanks Kaniz.Yes, I did some testing. With some schema, I read the same data source and write the parsing results to diff tables. For 586K rows, the perf diff is 9sec vs. 37sec. For 2.3 million rows, 16sec vs. 133sec. 

  • 2 kudos
1 More Replies
vemash
by New Contributor
  • 1062 Views
  • 1 replies
  • 0 kudos

How to create a docker image to deploy and run in different environments in databricks?

I am new to databricks, and trying to implement below task.Task:Once code merges to main branch and build is successful  CI pipeline and all tests are passed, docker build should start and create a docker image and push to different environments (fro...

  • 1062 Views
  • 1 replies
  • 0 kudos
Latest Reply
MichTalebzadeh
Valued Contributor
  • 0 kudos

Hi,This is no different for building docker image for various environmentsLet us try a simple high level CI/CD pipeline for building Docker images and deploying them to different environments:. It works in all environments including Databricks     ...

  • 0 kudos
ae20cg
by New Contributor III
  • 11372 Views
  • 17 replies
  • 12 kudos

How to instantiate Databricks spark context in a python script?

I want to run a block of code in a script and not in a notebook on databricks, however I cannot properly instantiate the spark context without some error.I have tried ` SparkContext.getOrCreate()`, but this does not work.Is there a simple way to do t...

  • 11372 Views
  • 17 replies
  • 12 kudos
Latest Reply
Kaizen
Valued Contributor
  • 12 kudos

I came across a similar issue. Please detail how you are executing the python script. Are you calling it from the web terminal? or from a notebook?Note: If you are calling it from the web terminal - your spark session wont be passed. You could create...

  • 12 kudos
16 More Replies
vishwanath_1
by New Contributor III
  • 1742 Views
  • 3 replies
  • 0 kudos

Resolved! Loading spark dataframe to Mongo collection isn't allowing nulls

I am using below command to push DataFrame to Mongo Collection.There are few null values in String and Double datatype columns , we see these are getting missed when pushed to mongo even after using the option("ignoreNullValues", false) inputproddata...

  • 1742 Views
  • 3 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @vishwanath_1, Let’s address the issue with null values in your DataFrame when pushing it to a MongoDB collection. The ignoreNullValues option in Spark is designed to control whether null values should be ignored during write operations. Howeve...

  • 0 kudos
2 More Replies
deltax_07
by New Contributor
  • 733 Views
  • 1 replies
  • 0 kudos

Parse_Syntax_Error Help

i'm getting this error: Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near ','.(line 1, pos 18) == SQL == sum(mp4) AS Videos, sum(csv+xlsx) AS Sheets, sum(docx+txt+pdf) AS Docu...

  • 733 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @deltax_07, The error you’re encountering seems to be related to the syntax in your Spark SQL query. Let’s break it down and address the issue. The problematic part of your query is this section: sum(mp4) AS Videos, sum(csv+xlsx) AS Sheets, sum...

  • 0 kudos
alxsbn
by New Contributor III
  • 1050 Views
  • 1 replies
  • 0 kudos

Resolved! SELECT issue after an OPTIMIZE operation

I have a strange issue after an OPTIMIZE, no results are returned anymore.I can time travel over the version easily but passed this data nothing when I'm doing a simple SELECT *.But I still got a result when I'm doing a SELECT count(*).How is this po...

  • 1050 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @alxsbn, After performing an OPTIMIZE operation, you’ve encountered an interesting situation where no results are returned when you execute a simple SELECT * query, but you still get results when you run SELECT COUNT(*). Let’s explore this furthe...

  • 0 kudos
rbauer
by New Contributor
  • 870 Views
  • 1 replies
  • 0 kudos

Dask-Databricks init script not working

Hello everybody !  I am trying to use the Dask-Databricks distribution (https://github.com/dask-contrib/dask-databricks?tab=readme-ov-file)i set up the required init-script according to the instructions on the Github page and had no problems there, h...

  • 870 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @rbauer, It seems you’re encountering issues with starting your Dask-Databricks cluster due to a non-zero exit status from your init script. Let’s troubleshoot this together. Here are some steps you can take to address the problem: Check the I...

  • 0 kudos
Soh_m
by New Contributor
  • 1289 Views
  • 1 replies
  • 0 kudos

Error accessing Managed Table with Row Level Security using Databricks Cluster

Hi Everyone,We are trying to implement Row Level Security In Delta Table and done testing (i.e. sql execution api,sql editor,sql notebook) using Sql Serverless in Unity Catalog.But when tried to access the table having RLS in Notebook using Pyspark w...

  • 1289 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Soh_m, When dealing with JSON data in your streaming source, you have a couple of options for extracting fields. Let’s explore the trade-offs between using the colon sign operator and the schema+from_json function: Colon Sign Operator: The c...

  • 0 kudos
CBL
by New Contributor
  • 872 Views
  • 1 replies
  • 0 kudos

Schema Evolution in Azure databricks

Hi All -In my scenario, Loading data from 100 of Json files.Problem is, fields/columns are missing when JSON file contains new fields.Full Load: while writing JSON to delta use the option ("mergeschema", "true") so that we do not miss new columns Inc...

  • 872 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @CBL, Handling schema evolution during incremental data loads is crucial to ensure data consistency and prevent issues when new fields are introduced. Let’s explore some strategies for schema comparison in incremental loads: Checksum-based In...

  • 0 kudos
jabori
by New Contributor
  • 1528 Views
  • 1 replies
  • 0 kudos

How can I pass job parameters to a dbt task?

I have a dbt task that will use dynamic parameters from the job: {"start_time": "{{job.start_time.[timestamp_ms]}}"}My SQL is edited like this:select 1 as idunion allselect null as idunion allselect {start_time} as idThis causes the task to fail. How...

  • 1528 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @jabori , To correctly pass the start_time parameter in your dbt task, you can utilize dynamic value references provided by Databricks. These templated variables are replaced with appropriate values during task execution. Here’s how you can mod...

  • 0 kudos
Phani1
by Valued Contributor II
  • 6070 Views
  • 1 replies
  • 0 kudos

What are optimized solutions for moving on-premise Hadoop data

Hi Team ,What are optimized solutions for moving on-premise Hadoop/hadoop distributed file system  parquet data to Databricks  as Delta file?Regards,Phanindra

Data Engineering
delta
hadoop
  • 6070 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Phani1, Migrating data from on-premises Hadoop to Databricks as Delta files involves several key steps. Let’s break it down: Administration: In Hadoop, you’re dealing with a monolithic distributed storage and computing platform. It consists ...

  • 0 kudos
chakradhar545
by New Contributor
  • 657 Views
  • 1 replies
  • 0 kudos

DatabricksThrottledException Error

Hi,Our scheduled job runs into below error once in a while and job fails. Any leads or thoughts please why we run into this once in a while and how to fix it pleaseshaded.databricks.org.apache.hadoop.fs.s3a.DatabricksThrottledException: Instantiate s...

  • 657 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @chakradhar545, The error message you’re encountering indicates a throttling issue when interacting with Amazon S3 using Databricks. Let’s break down the error and explore potential solutions: Error Details: The error message mentions two key...

  • 0 kudos
Surya0
by New Contributor III
  • 3509 Views
  • 4 replies
  • 0 kudos

Resolved! Unit hive-metastore.service not found

Hi Everyone,I've encountered an issue while trying to make use of the hive-metastore capability in Databricks to create a new database and table for our latest use case. The specific command I used was "create database if not exists newDB". However, ...

  • 3509 Views
  • 4 replies
  • 0 kudos
Latest Reply
rakeshprasad1
New Contributor III
  • 0 kudos

@Surya0 : i am facing same issue. stack trace is  Could not connect to address=(host=consolidated-northeuropec2-prod-metastore-2.mysql.database.azure.com)(port=3306)(type=master) : Socket fail to connect to host:consolidated-northeuropec2-prod-metast...

  • 0 kudos
3 More Replies
alexgv12
by New Contributor III
  • 1124 Views
  • 1 replies
  • 0 kudos

how to deploy sql functions in pool

we have some function definitions which we have to have available for our bi tools e.g.  CREATE FUNCTION CREATEDATE(year INT, month INT, day INT) RETURNS DATE RETURN make_date(year, month, day); how can we always have this function definition in our ...

  • 1124 Views
  • 1 replies
  • 0 kudos
Latest Reply
alexgv12
New Contributor III
  • 0 kudos

looking at some alternatives with other databricks components, I think that a CI/CD process should be created where the view can be created through the databricks api. https://docs.databricks.com/api/workspace/functions/createhttps://community.databr...

  • 0 kudos
dbal
by New Contributor III
  • 2515 Views
  • 2 replies
  • 0 kudos

Resolved! Spark job task fails with "java.lang.NoClassDefFoundError: org/apache/spark/SparkContext$"

Hi.I am trying to run a Spark Job in Databricks (Azure) using the JAR type.I can't figure out why the job fails to run by not finding the SparkContext.Databricks Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)Error message: java.lang.NoCl...

  • 2515 Views
  • 2 replies
  • 0 kudos
Latest Reply
dbal
New Contributor III
  • 0 kudos

Update 2: I found the reason in the documentation. This is documented under "Access Mode", and it is a limitation of the Shared access mode.Link: https://learn.microsoft.com/en-us/azure/databricks/compute/access-mode-limitations#spark-api-limitations...

  • 0 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels