Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I am attempting to use Databricks Connect with a cluster in Azure Government with a port of 443 but get the following error when running databricks-connect test.The port you specified is either being used already or invalid. Port: The port that Data...
Hi everybody.
Looks like EXISTS statement works incorrectly.
If i execute the following statement in SQL Server it returns one row, as it should
WITH a AS (
SELECT '1' AS id, 'Super Company' AS name
UNION
SELECT '2' AS id, 'SUPER COMPANY...
In newer versions of spark it's possible to use ANTI JOIN and SEMI JOIN
It looks this way:WITH a AS ( SELECT '1' AS id, 'Super Company' AS name UNION SELECT '2' AS id, 'SUPER COMPANY' AS name ), b AS ( SELECT
'a@b.com' AS user_username, 'Super Co...
Hi team,
New to Databricks and trying to understand if there is a "True" auto-start capability with Databricks. We are evaluating Databricks Delta lake as an alternative cloud based datawarehouse solution but the biggest problem I see is the inabili...
Just adding on to this.
Using DBeaver as a client and using a singlenode cluster and a pool of idling VM, it was possible to get the autostart time of the cluster down to 35 seconds, + 17 seconds for the query time on top to show the first 200 rows ...
Hi there,
Trying to decide if I am going to get started with ml and really enjoyed it so far.
When going through the documentation, there was a blocker moment for me, as I feel the documentation doesn't mention much about the dataset used to train t...
I am working on pandas and python.After processing a particular dataframe in my program , I am appending that dataframe below an existing Excel file. Now problem is my excel has font size of 11 pt but dataframe has font size of 12 pt.I want to set f...
HiI'm loading df from redis using this code:df = (spark.read.format("org.apache.spark.sql.redis")
.option("table", f"state_store_ready_to_sell")
.option("key.column", "msid").option("infer.schema", "true").load()and then i'm running f...
Hi guys,
I am running a production pipeline (Databricks Runtime 7.3 LTS) that keeps failing for some delta file reads with the error:
21/07/19 09:56:02 ERROR Executor: Exception in task 36.1 in stage 2.0 (TID 58)
com.databricks.sql.io.FileReadExcept...
Question: sparkR.session() gives an error when run on web terminal, while it runs in a notebook. What parameters should be provided to create a spark session on web terminal?
PS: I am trying to run a .R file using Rscript call on terminal instead ...
What's the best way to add an external table so another cluster/workspace can access an existing external table on S3? I need to redeploy my workspace into a new VPC, so I am not expecting any collisions of the warehouses. Is it as simple as adding ...
I have a scenario where I have a series of jobs that are triggered in ADF, the jobs are not linked as such but the resulting temporally tables from each job takes up memory of the databricks cluster. If I can clear the notebook state, that would fre...
In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail wit...
I am confused by what's difference between running code using command python3 CODENAME.py and launch it by commend pyspark and start working on the code.
When I run the code : spark = SparkSession.builder.config("spark.driver.memory", "16").appName(...
I am seeing a super weird behaviour in databricks. We initially configured the following:
1. Account X in Account Console -> AWS Account arn:aws:iam::X:role/databricks-s3
2. We setup databricks-s3 as S3 bucket in Account Console -> AWS Storage
3. W...
My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2:
The code I have tried is the following:
If I print the top 5 rows of new_df2, it gives the output as expected but I cannot pri...