Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
Here's your Data + AI Summit 2024 - Warehousing & Analytics recap as you use intelligent data warehousing to improve performance and increase your organization’s productivity with analytics, dashboards and insights.
Keynote: Data Warehouse presente...
Is there any business use-case where profile_metrics and drift_metrics are used by Databricks customers.If so,kindly provide the scenario where to leverage this feature e.g data lineage,table metadata updates.
hey @pankaj2264. both profile metric and drift metric tables are created and used by Lakehouse monitoring to assess the performance of your model and data over time or relative to a baseline table. you can find all the relevant information here Intro...
I wrote simple code:from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, max
import pyspark.sql.functions as F
streaming_data = spark.read.table("x")
window = Window.partitionBy("BK...
Hi,In my opinion the result is correctWhat needs to be noted in the result is that it is sorted by the "Onboarding_External_LakehouseId" column so if there is "BK_AccountApplicationId" with the same code, it will be partitioned into 2 row_numbersJust...
Hi!I receive three streams from a postgres CDC. These 3 tables, invoices users and products, need to be joined. I want to use a left join with respect the invoices stream. In order to compute correct results and release old states, I use watermarks a...
Hi!I am exploring the read state functionality in spark streaming: https://docs.databricks.com/en/structured-streaming/read-state.htmlWhen I start a streaming query like this: (
...
.writeStream
.option("checkpointLocation", f"{CHECKPOIN...
Hi,I am trying to make Stream - Static join with aggregation with no luck. I have a streaming table where I am getting events with two nasted arraysID Array1 Array21 [1,2] [3,4]I need make two joins to static dictionary tables (without an...
SQL warehouse can auto-terminate after 1 minute, not 5, as in UI. Just run a simple CLI command. Of course, with such a low auto termination, you lose the benefit of CACHE, but for some ad-hoc queries, it is the perfect setup when combined with serve...
Hi @Hubert-Dudek , Hope you are doing well!
Could you please clarify more on your ask here?
However, from the above details, the SQL warehouse mentioned is auto-terminating after 1 minute of inactivity because the Auto stop is set to 1 minute. Howe...
Hi all! The Databricks Looker Studio connector has now been available for a few weeks. Tested the connector but running into several issues: I am used to working with dynamic queries, so I am able to use date parameters (similar to BigQuery Looker St...
Hi @Retired_mod Hope you're doing well! I am very curious about the following thing: However, there might be workarounds or alternative approaches to achieve similar functionality. You could explore using Looker’s native features for dynamic filterin...
Im trying to set up a connection to Iceberg on S3 via Snowflake as described https://medium.com/snowflake/how-to-integrate-databricks-with-snowflake-managed-iceberg-tables-7a8895c2c724 and https://docs.snowflake.com/en/user-guide/tables-iceberg-catal...
Hi @Retired_mod ,We've been working on setting up Glue as catalog, which is working fine so far. However, Glue takes place of the hive_metastore, which appears to be a legacy way of setting this up. Is the way proposed here the recommended way to set...
Hi,I want to remove duplicate rows from my managed delta table in my unity catalog. I use a query on a SQL warehouse similar to this: WITH cte AS (
SELECT
id, ROW_NUMBER() OVER (PARTITION BY id,##,##,## ORDER BY ts) AS row_num
FROM
catalog.sch...
I have first tried to use _metadata.row_index to delete the correct rows but also this resulted in an error. My solution was now to use spark and overwrite the table.table_name = "catalog.schema.table"
df = spark.read.table(table_name)
count_df = df....
Recently, it seems that there has been an intermittent issue where the output of a notebook cell doesn't display, even though the code within the cell executes successfully. For instance, there are times when simply printing a dataframe yields no out...
select {{user_defined_variable}} as my_var, count(*) as cntfrom my_tablewhere {{user_defined_variable}} = {{value}} for user_defined_variable, I use query based dropdown list to get a column_name I'd like ...
Hey,I've managed to add my SQL Warehouse as a data source in Pycharm using the JDBC driver and can query the warehouse from an SQL console within Pycharm. This is great, however, what I'm struggling with is getting the catalogs and schemas to show in...
You need to explicitly tell your JetBrains tool to introspect the database using JDBC metadata.I think the reason it (sometimes) works in Datagrip but not PyCharm, IntelliJ, etc is because the default settings can be different across tools and even v...
I am currently trying to write a dataframe to s3 likedf.write.partitionBy("col1","col2").mode("overwrite").format("json").save("s3a://my_bucket/")The path becomes `s3a://my_bucket/col1=abc/col2=opq/`But I want to path to be `s3a://my_bucket/abc/opq/`...
Hi @Jennifer ,
The default behavior of the .partitionBy() function in Spark is to create a directory structure with partition column names. This is similar to Hive's partitioning scheme and is done for optimization purposes. Hence, you cannot directl...
I am in the process of connecting Looker to one of my Databricks databases. To reduce startup time on my SQL warehouse cluster I would like to change the type from "Pro" to "Serverless". I cannot find a way to do that and "Serverless" is not an optio...