Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi there,
Trying to decide if I am going to get started with ml and really enjoyed it so far.
When going through the documentation, there was a blocker moment for me, as I feel the documentation doesn't mention much about the dataset used to train t...
I am working on pandas and python.After processing a particular dataframe in my program , I am appending that dataframe below an existing Excel file. Now problem is my excel has font size of 11 pt but dataframe has font size of 12 pt.I want to set f...
HiI'm loading df from redis using this code:df = (spark.read.format("org.apache.spark.sql.redis")
.option("table", f"state_store_ready_to_sell")
.option("key.column", "msid").option("infer.schema", "true").load()and then i'm running f...
Hi guys,
I am running a production pipeline (Databricks Runtime 7.3 LTS) that keeps failing for some delta file reads with the error:
21/07/19 09:56:02 ERROR Executor: Exception in task 36.1 in stage 2.0 (TID 58)
com.databricks.sql.io.FileReadExcept...
Question: sparkR.session() gives an error when run on web terminal, while it runs in a notebook. What parameters should be provided to create a spark session on web terminal?
PS: I am trying to run a .R file using Rscript call on terminal instead ...
What's the best way to add an external table so another cluster/workspace can access an existing external table on S3? I need to redeploy my workspace into a new VPC, so I am not expecting any collisions of the warehouses. Is it as simple as adding ...
I have a scenario where I have a series of jobs that are triggered in ADF, the jobs are not linked as such but the resulting temporally tables from each job takes up memory of the databricks cluster. If I can clear the notebook state, that would fre...
In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail wit...
I am confused by what's difference between running code using command python3 CODENAME.py and launch it by commend pyspark and start working on the code.
When I run the code : spark = SparkSession.builder.config("spark.driver.memory", "16").appName(...
I am seeing a super weird behaviour in databricks. We initially configured the following:
1. Account X in Account Console -> AWS Account arn:aws:iam::X:role/databricks-s3
2. We setup databricks-s3 as S3 bucket in Account Console -> AWS Storage
3. W...
My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2:
The code I have tried is the following:
If I print the top 5 rows of new_df2, it gives the output as expected but I cannot pri...
Hi,
I have a metadata csv file which contains column name, and datatype such as
Colm1: INT
Colm2: String.
I can also get the same in a json format as shown:
I can store this on ADLS. How can I convert this into a schema like: "Myschema" that I can ...
We are getting below error when we tried to set the date in preparedstatement using Simba Spark Jdbc Driver.
Exception:
Query execution failed: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.h...
I used @pandas_udf write a function for speeding up the process(parsing xml file ) and then compare it's speed with single thread , Surprisingly , Using @pandas_udf is two times slower than single-thread code. And the number of xml files I need to p...