cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

chanansh
by Contributor
  • 1072 Views
  • 1 replies
  • 0 kudos

stream from azure credentials

I am trying to read stream from azure:(spark.readStream .format("cloudFiles") .option('cloudFiles.clientId', CLIENT_ID) .option('cloudFiles.clientSecret', CLIENT_SECRET) .option('cloudFiles.tenantId', TENTANT_ID) .option("header", "true") .opti...

  • 1072 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Hanan Shteingart​ :It looks like you're using the Azure Blob Storage connector for Spark to read data from Azure. The error message suggests that the credentials you provided are not being used by the connector.To specify the credentials, you can se...

  • 0 kudos
fhmessas
by New Contributor II
  • 2506 Views
  • 1 replies
  • 0 kudos

Resolved! Autoloader stream with EventBridge message

Hi All,I have a few streaming jobs running but we have been facing an issue related to messaging. We have multiple feeds within the same root rolder i.e. logs/{accountId}/CloudWatch|CloudTrail|vpcflow/yyyy-mm-dd/logs. Hence, the SQS allows to setup o...

  • 2506 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Fernando Messas​ :Yes, you can configure Autoloader to consume messages from an SQS queue using EventBridge. Here are the steps you can follow:Create an EventBridge rule to filter messages from the SQS queue based on a specific criteria (such as the...

  • 0 kudos
bchaubey
by Contributor II
  • 3444 Views
  • 1 replies
  • 0 kudos

unable to connect with Azure Storage with Scala

Hi Team, I am unable to connect Storage account with scala in Databricks, getting bellow error.AbfsRestOperationException: Status code: -1 error code: null error message: Cannot resolve hostname: ptazsg5gfcivcrstrlrs.dfs.core.windows.netCaused by: Un...

  • 3444 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Bhagwan Chaubey​ :The error message suggests that the hostname for your Azure Storage account could not be resolved. This could happen if there is a network issue, or if the hostname is incorrect.Here are some steps you can try to resolve the issue:...

  • 0 kudos
Data_Sam
by New Contributor II
  • 823 Views
  • 1 replies
  • 1 kudos

Streaming data apply change error not function with incoming files

Hi all,When I design a streaming data pipeline with incoming moving files and used apply chnge function on silver table comparing change between bronze and silver for removing duplicates based on key columns, do you know why I got ignore change to tr...

  • 823 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Raymond Huang​ :The error message "ignore changes to true" typically occurs when you are trying to apply changes to a table using Delta Lake's change data capture (CDC) feature, but you have set the option ignoreChanges to true. This option tells De...

  • 1 kudos
bobbysidhartha
by New Contributor
  • 13736 Views
  • 1 replies
  • 0 kudos

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table. In the beginning we were loading data into the delta table by using the merge ...

WbOeJ 6MYWV
  • 13736 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@bobbysidhartha​ :When merging data into a partitioned Delta table in parallel, it is important to ensure that each job only accesses and modifies the files in its own partition to avoid concurrency issues. One way to achieve this is to use partition...

  • 0 kudos
NakedSnake
by New Contributor III
  • 755 Views
  • 1 replies
  • 0 kudos

Connect to resource in another AWS account using transit gateway, not working

I`m trying to reach a service hosted in another AWS account through transit gateway. Databricks environment was created using Terraform, from the template available in the official documentation.Placing a VM in Databricks` private subnets makes us ab...

  • 755 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Thomaz Moreira​ :It sounds like there might be an issue with the network configuration of your Databricks cluster. Here are a few things you can check:Make sure that your Databricks cluster is in the same VPC as your service in the other AWS account...

  • 0 kudos
anonturtle
by New Contributor
  • 1176 Views
  • 1 replies
  • 0 kudos

How does automl classify which feature is numeric or categorical?

When running automl on its UI, it classifies a feature "local_convenience_store" as both a numeric and categorical column. This affects the result as for numeric columns a scaler is used while in a categorical column it is one hot encoded. For contex...

  • 1176 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@hr then​ :The approach taken by AutoML to classify features as numeric or categorical depends on the specific AutoML framework or library being used, as different implementations may use different methods or heuristics to make this determination.In ...

  • 0 kudos
Llop
by New Contributor II
  • 1284 Views
  • 1 replies
  • 0 kudos

Delta Live Tables CDC doubts

We are trying to migrate to Delta Live Tables an Azure Data Factory pipeline which loads CSV files and outputs Delta Tables in Databricks.The pipeline is triggered on demand via an external application which places the files in a Storage folder and t...

  • 1284 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Enric Llop​ :When using Delta Live Tables to perform a "rip and replace" operation, where you want to replace the existing data in a table with new data, there are a few things to keep in mind.First, the apply_changes function is used to apply chang...

  • 0 kudos
190809
by Contributor
  • 918 Views
  • 1 replies
  • 0 kudos

Trying to figure out what is causing non-null values in my bronze tables to be returned as NULL in silver tables.

I have a process which loads data from json to a bronze table. It then adds a couple of columns and creates a silver table. But the silver table has NULL values where there were values in the bronze tables. Process as follows:def load_to_silver(sourc...

  • 918 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Rachel Cunningham​ :One possible reason for this issue could be a data type mismatch between the bronze and silver tables. It is possible that the column in the bronze table has a non-null value, but the data type of that column is different from th...

  • 0 kudos
Harsh_Paliwal
by New Contributor
  • 2114 Views
  • 1 replies
  • 0 kudos

java.lang.Exception: Unable to start python kernel for ReplId-79217-e05fc-0a4ce-2, kernel exited with exit code 1.

I am running a parameterized autoloader notebook in a workflow.This notebook is being called 29 times in parallel, and FYI UC is also enabled.I am facing this error:java.lang.Exception: Unable to start python kernel for ReplId-79217-e05fc-0a4ce-2, ke...

image
  • 2114 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Harsh Paliwal​ :The error message suggests that there might be a conflict with the xtables lock.One thing you could try is to add the -w option as suggested by the error message. You can add the following command to the beginning of your notebook t...

  • 0 kudos
Chris_Konsur
by New Contributor III
  • 2196 Views
  • 1 replies
  • 0 kudos

Unit test with Nutter

When I run the simple test in a notebook, it works fine, but when I run it from the Azure ADO pipeline, it fails with the error.code;def __init__(self):  NutterFixture.__init__(self)  from runtime.nutterfixture import NutterFixture, tagclass uTestsDa...

  • 2196 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Chris Konsur​ :The error message suggests that there is an issue with the standard output buffer when the Python interpreter is shutting down, which could be related to daemon threads. This error is not specific to Databricks or Azure ADO pipeline, ...

  • 0 kudos
danniely
by New Contributor II
  • 11958 Views
  • 1 replies
  • 2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

  • 11958 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@hyunho lee​ : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

  • 2 kudos
quakenbush
by Contributor
  • 6610 Views
  • 1 replies
  • 0 kudos

Is there something like Oracle's VPD-Feature in Databricks?

Since I am porting some code from Oracle to Databricks, I have another specific question.In Oracle there's something called Virtual Private Database, VPD. It's a simple security feature used to generate a WHERE-clause which the system will add to a u...

  • 6610 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Roger Bieri​ :In Databricks, you can use the UserDefinedFunction (UDF) feature to create a custom function that will be applied to a DataFrame. You can use this feature to add a WHERE clause to a DataFrame based on the user context. Here's an exampl...

  • 0 kudos
Fed
by New Contributor III
  • 6151 Views
  • 1 replies
  • 0 kudos

Setting checkpoint directory for checkpointInterval argument of estimators in pyspark.ml

Tree-based estimators in pyspark.ml have an argument called checkpointIntervalcheckpointInterval = Param(parent='undefined', name='checkpointInterval', doc='set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will ...

  • 6151 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Federico Trifoglio​ :If sc.getCheckpointDir() returns None, it means that no checkpoint directory is set in the SparkContext. In this case, the checkpointInterval argument will indeed be ignored. To set a checkpoint directory, you can use the SparkC...

  • 0 kudos
Phani1
by Valued Contributor II
  • 3247 Views
  • 1 replies
  • 0 kudos

best practices/steps for hive meta store backup and restore.

Hi Team,Could you share with us the best practices/steps for hive meta store backup and restore?Regards,Phanindra

  • 3247 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Janga Reddy​ :Certainly! Here are the steps for Hive metastore backup and restore on Databricks:Backup:Stop all running Hive services and jobs on the Databricks cluster.Create a backup directory in DBFS (Databricks File System) where the metadata fi...

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels