Data Engineering

Forum Posts

Sorted by:

by User16790091296 • Contributor II

06-24-2021 8:18:57 AM

871 Views
1 replies
0 kudos

How to Prevent Duplicate Entries to enter to delta lake of Azure Storage?

I Have a Dataframe stored in the format of delta into Adls, now when im trying to append new updated rows to that delta lake it should, Is there any way where i can delete the old existing record in delta and add the new updated Record.There is a uni...

Data Engineering

871 Views
1 replies
0 kudos

06-24-2021 8:18:57 AM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

06-24-2021 8:29:03 AM

0 kudos

To achieve this you should use a merge command that will update rows that are existing with the unique ID. This will update the rows that already exist and insert the rows that do not. If you want to do it manually, you could delete rows using the DE...

0 kudos

06-24-2021 8:29:03 AM

by User16790091296 • Contributor II

06-24-2021 8:28:26 AM

569 Views
0 replies
1 kudos

How to get access to Databricks SQL analytics?

am trying to do this tutorial about databricks sql analytics (https://docs.microsoft.com/en-us/azure/databricks/sql/get-started/admin-quickstart) but when i create my databricks workspace i do not have the icon at the bottom of the sidebar to access ...

Data Engineering

569 Views
0 replies
1 kudos

06-24-2021 8:28:26 AM

by Srikanth_Gupta_ • Valued Contributor

06-24-2021 8:26:28 AM

461 Views
0 replies
0 kudos

Can we define worker node type/auto scaling in DLT pipeline setting

Data Engineering

461 Views
0 replies
0 kudos

06-24-2021 8:26:28 AM

by User16790091296 • Contributor II

06-24-2021 8:18:33 AM

665 Views
0 replies
0 kudos

How to delete users by email id using azure SCIM api in databricks?

I need to know if there is a way to delete a user from databricks using email only using SCIM api? As of now I can see it can only delete user by ID which means I need to first retrive the ID of the user and then use it to delete.I am using this api ...

Data Engineering

665 Views
0 replies
0 kudos

06-24-2021 8:18:33 AM

by User16869510359 • Esteemed Contributor

06-24-2021 7:51:39 AM

4659 Views
1 replies
0 kudos

Resolved! Do ganglia report incorrect memory stats?

I am looking at the memory utilization of the executors and I see the heap utilization of the executor is far less than what is reported in the Ganglia. Why do ganglia report incorrect memory details.

Data Engineering

4659 Views
1 replies
0 kudos

06-24-2021 7:51:39 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 8:18:10 AM

0 kudos

Ganglia reports the memory utilization at the system level. Say for example if the JVM has Xmx value of 100 GB. At some point, it will occupy 100GB and then with a Garbage collection, it will clear off the heap. Once the GC frees up the memory, th...

0 kudos

06-24-2021 8:18:10 AM

by User16790091296 • Contributor II

06-24-2021 8:17:28 AM

414 Views
0 replies
0 kudos

How to efficiently read the data lake files' metadata?

I want to read the last modified datetime of the files in data lake in a databricks script. If I could read it efficiently as a column when reading data from data lake, it would be perfect.Thank you:)

Data Engineering

414 Views
0 replies
0 kudos

06-24-2021 8:17:28 AM

by User16790091296 • Contributor II

06-24-2021 8:11:12 AM

610 Views
0 replies
0 kudos

How to connect database and return result-set for each query in pyspark using databricks

Data Engineering

610 Views
0 replies
0 kudos

06-24-2021 8:11:12 AM

by User16790091296 • Contributor II

06-24-2021 8:09:20 AM

1172 Views
0 replies
1 kudos

What is the most efficient way to read in a partitioned parquet file with pyspark?

I work with parquet files stored in AWS S3 buckets. They are multiple TB in size and partitioned by a numeric column containing integer values between 1 and 200, call it my_partition. I read in and perform compute actions on this data in Databricks w...

Data Engineering

1172 Views
0 replies
1 kudos

06-24-2021 8:09:20 AM

by User16869510359 • Esteemed Contributor

06-24-2021 6:54:39 AM

1632 Views
1 replies
0 kudos

Resolved! Is it mandatory to checkpoint my streaming query.

I have ad-hoc one-time streaming queries where I believe checkpoint won't give any value add. Should I still use checkpointing

Data Engineering

1632 Views
1 replies
0 kudos

06-24-2021 6:54:39 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:57:33 AM

0 kudos

It's not mandatory. But the strong recommendation is to use Checkpointing for Streaming irrespective of your use case. This is because the default checkpoint location can get a lot of files over time as there is no graceful guaranteed cleaning in pla...

0 kudos

06-24-2021 6:57:33 AM

by User16783855534 • New Contributor III

06-23-2021 12:50:12 PM

819 Views
2 replies
0 kudos

Should/Can I use spark streaming for Batch workloads?

Its preferable to use spark streaming (with Delta) for batch workloads rather then regular batch. With the trigger.once trigger whenever the streaming job is started it will process whatever is available in the source (kafka/kinesis/File System) and ...

Data Engineering

819 Views
2 replies
0 kudos

06-23-2021 12:50:12 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:52:51 AM

0 kudos

The streaming checkpoint mechanism is independent of the Trigger type. The way checkpoint works are it creates an offset file when processing the batch and once the batch is completed it creates a commit file for that batch in the checkpoint director...

0 kudos

06-24-2021 6:52:51 AM

1 More Replies

by User16869510359 • Esteemed Contributor

06-23-2021 3:56:43 PM

565 Views
1 replies
0 kudos

How to migrate to Auto-loader without downtime?

I have an S3-SQS workload. Is it possible to migrate the workload to autoloader without downtime? What are the migration guidelines.

Data Engineering

565 Views
1 replies
0 kudos

06-23-2021 3:56:43 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:32:32 AM

0 kudos

The SQS queue used by the existing application can be utilized by the auto-loader thereby ensuring minimal downtime

0 kudos

06-24-2021 6:32:32 AM

by User16869510359 • Esteemed Contributor

06-24-2021 2:21:47 AM

781 Views
1 replies
0 kudos

Resolved! Why do I always get an error on querying the Parquet table - Parquet does not support timestamp

Data Engineering

781 Views
1 replies
0 kudos

06-24-2021 2:21:47 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 2:24:16 AM

0 kudos

The issue can happen if the Hive syntax for table creation is used instead of the Spark syntax. Read more here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-hiveformat.htmlThe issue mentioned in t...

0 kudos

06-24-2021 2:24:16 AM

by User16869510359 • Esteemed Contributor

06-24-2021 2:12:23 AM

2659 Views
1 replies
0 kudos

Resolved! How to track the history of schema changes for a Delta table

I have a Delta table that had schema changes in multiple commits. I wanted to track all these schema changes that happened on the Delta table. The "DESCRIBE HISTORY" is not useful as it logs the schema change made by ALTER TABLE operations.

Data Engineering

2659 Views
1 replies
0 kudos

06-24-2021 2:12:23 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 2:14:44 AM

0 kudos

When a write operation is performed with columns added. we are not explicitly showing that in DESCRIBE HISTORY output. Only an entry is made for write. and in the operation Parameters, it's not showing anything about schema evolution. whereas if we d...

0 kudos

06-24-2021 2:14:44 AM

by User16869510359 • Esteemed Contributor

06-23-2021 11:38:23 PM

1886 Views
1 replies
0 kudos

Resolved! Can I connect to Eventhub using Kafka API? Which is the preferred way?

Data Engineering

1886 Views
1 replies
0 kudos

06-23-2021 11:38:23 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:56:37 PM

0 kudos

Yes, it's possible to use Kafka API to connect to the eventhub. Eventhub supports the usage of Kafka API to stream the data from the EventhubReference: https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-for-kafka-ecosystem-overviewSample pr...

0 kudos

06-23-2021 11:56:37 PM

by User16869510359 • Esteemed Contributor

06-23-2021 11:33:44 PM

12847 Views
1 replies
0 kudos

Resolved! How do I change the log level in Databricks?

How can I change the log level of the Spark Driver and executor process?

Data Engineering

12847 Views
1 replies
0 kudos

06-23-2021 11:33:44 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 11:47:51 PM

0 kudos

Change the log level of Driver:%scala spark.sparkContext.setLogLevel("DEBUG") spark.sparkContext.setLogLevel("INFO")Change the log level of a particular package in Driver logs:%scala org.apache.log4j.Logger.getLogger("shaded.databricks.v201809...

0 kudos

06-23-2021 11:47:51 PM

User

Count

1601

736

343

284

246

Databricks

Forum Posts

How to Prevent Duplicate Entries to enter to delta lake of Azure Storage?

How to get access to Databricks SQL analytics?

Can we define worker node type/auto scaling in DLT pipeline setting

How to delete users by email id using azure SCIM api in databricks?

Resolved! Do ganglia report incorrect memory stats?

How to efficiently read the data lake files' metadata?

How to connect database and return result-set for each query in pyspark using databricks

What is the most efficient way to read in a partitioned parquet file with pyspark?

Resolved! Is it mandatory to checkpoint my streaming query.

Should/Can I use spark streaming for Batch workloads?

How to migrate to Auto-loader without downtime?

Resolved! Why do I always get an error on querying the Parquet table - Parquet does not support timestamp

Resolved! How to track the history of schema changes for a Delta table

Resolved! Can I connect to Eventhub using Kafka API? Which is the preferred way?

Resolved! How do I change the log level in Databricks?

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...

Addressing Pipeline Error Handling in Databricks b...