Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi,I'm doing merge to my Delta Table which has IDENTITY column:Id BIGINT GENERATED ALWAYS AS IDENTITYInserted data has in the id column every other number, like this:Is this expected behavior? Is there any workaround to make number increasing by 1?
I already improved a lot the performances of our ETL (x20 !) but I still want to know where I can improve performances. I seems that tables stats and column indexing slow down a bit writings so I want to decrease dataSkippingNumIndexedCols to match t...
I have a table called MetaData and what columns are needed in the select are stored in MetaData.columnsI would like to read columns dynamically from MetaData.columns and create a view based on that.csv_values = "col1, col2, col3, col4"df = spark.crea...
We want to use the INSERT INTO command with specific columns as specified in the official documentation. The only requirements for this are️ Databricks SQL warehouse version 2022.35 or higher️ Databricks Runtime 11.2 and aboveand the behaviour shou...
Hi @Fusselmanwog Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...
I am trying to lowercase one of the columns(A_description) of a dataframe(df) and getting the error-"'Column' object is not callable".Code: def new_desc(): for line in df: line = df['A_description'].map(str.lower) return line new_desc()Have used...
Hi @sandeep tummala , Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...
I have been exploring Autoloader to ingest gzipped JSON files from an S3 source.The notebook fails in the first run due to schema mismatch, after re-running the notebook, the schema evolves and the ingestion runs successfully.On analysing the schema ...
Hi @Debayan Mukherjee , @Kaniz Fatma Thank you for replying to my question.I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to ch...
Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...
I just run `cursor.columns()` via the python client and I'll get back a `org.apache.hive.service.cli.HiveSQLException` as response. There is also a long stack trace, I'll just paste the last bit because it might be illuminating: org.apache.spark.sql....
I have a notebook with the code below, where I try to do an upsert into a dimension table and only include one column from the source table. I get an error even though I think the syntax matches what I see in the docs. How can I write this in the cor...
Hello everyone, every day I extract data into xls files but the column position changes every day. Is there any way to find these columns within the file?Here's a snippet of my code.df = spark.read.format("com.crealytics.spark.excel")\
.option("hea...
Hi @welder martins Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...
I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.I have tried setting the relat...
Thank you.What I did was:Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.That fixed the issue! You can define your own splits which would work as well but the splits thems...
I have a column that is an array of objects, let's call it ARRAY, and now I would like to query / manipulate, the elements object without using explode function, this is an example, for each element in that column I would like to create a path. .wit...
Hello I am working with Scala, and I used somehing similar:def play(col: Column): Column = { concat_ws("", lit(imagePath), lit("/"), col("field1"), lit("/"), col("field2"), lit(".ext"))}val variable = spark.lot_of_stuff. .withColumn("...
I have a task to transform a dataframe. The task is to collect all the columns in a row and embed it into a JSON string as a column.Source DF:Target DF:
Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error. org.apache.spark.SparkExc...