cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Leszek
by Contributor
  • 7270 Views
  • 1 replies
  • 2 kudos

IDENTITY columns generating every other number when merging

Hi,I'm doing merge to my Delta Table which has IDENTITY column:Id BIGINT GENERATED ALWAYS AS IDENTITYInserted data has in the id column every other number, like this:Is this expected behavior? Is there any workaround to make number increasing by 1?

image
  • 7270 Views
  • 1 replies
  • 2 kudos
Latest Reply
Dataspeaksss
New Contributor II
  • 2 kudos

Were you able to resolve it? I'm facing the same issue.

  • 2 kudos
Enthusiastic_Da
by New Contributor II
  • 6688 Views
  • 0 replies
  • 0 kudos

how to read columns dynamically using pyspark

I have a table called MetaData and what columns are needed in the select are stored in MetaData.columnsI would like to read columns dynamically from MetaData.columns and create a view based on that.csv_values = "col1, col2, col3, col4"df = spark.crea...

  • 6688 Views
  • 0 replies
  • 0 kudos
fuselessmatt
by Contributor
  • 7924 Views
  • 3 replies
  • 0 kudos

Omitting columns in an INSERT statement does not seem to work despite meeting the requirements

We want to use the INSERT INTO command with specific columns as specified in the official documentation. The only requirements for this are​️ Databricks SQL warehouse version 2022.35 or higher️ Databricks Runtime 11.2 and above​and the behaviour shou...

  • 7924 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Fusselmanwog​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

  • 0 kudos
2 More Replies
hv
by New Contributor
  • 3902 Views
  • 1 replies
  • 0 kudos

Error-"'Column' object is not callable".

I am trying to lowercase one of the columns(A_description) of a dataframe(df) and getting the error-"'Column' object is not callable".Code: def new_desc():  for line in df:    line = df['A_description'].map(str.lower)  return line new_desc()Have used...

  • 3902 Views
  • 1 replies
  • 0 kudos
Latest Reply
Chaitanya_Raju
Honored Contributor
  • 0 kudos

Hi @Himadri Verma​ Hope this below suggestion will help you in pyspark.Please let me know if you are looking for something elseHappy Learning!!

  • 0 kudos
STummala
by New Contributor
  • 1876 Views
  • 2 replies
  • 0 kudos
  • 1876 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi ​@sandeep tummala​ , Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...

  • 0 kudos
1 More Replies
ks1248
by New Contributor III
  • 2657 Views
  • 2 replies
  • 5 kudos

Resolved! Autoloader creates columns not present in the source

I have been exploring Autoloader to ingest gzipped JSON files from an S3 source.The notebook fails in the first run due to schema mismatch, after re-running the notebook, the schema evolves and the ingestion runs successfully.On analysing the schema ...

  • 2657 Views
  • 2 replies
  • 5 kudos
Latest Reply
ks1248
New Contributor III
  • 5 kudos

Hi @Debayan Mukherjee​ , @Kaniz Fatma​ Thank you for replying to my question.I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to ch...

  • 5 kudos
1 More Replies
lmcglone
by New Contributor II
  • 4733 Views
  • 2 replies
  • 3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

image
  • 4733 Views
  • 2 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

  • 3 kudos
1 More Replies
yzaehringer
by New Contributor
  • 1710 Views
  • 1 replies
  • 0 kudos

GET_COLUMNS fails with Unexpected character (\\'t\\' (code 116)): was expecting comma to separate Object entries - how to fix?

I just run `cursor.columns()` via the python client and I'll get back a `org.apache.hive.service.cli.HiveSQLException` as response. There is also a long stack trace, I'll just paste the last bit because it might be illuminating: org.apache.spark.sql....

  • 1710 Views
  • 1 replies
  • 0 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 0 kudos

this can be package issue or runtime issue, try to change both

  • 0 kudos
weldermartins
by Honored Contributor
  • 2378 Views
  • 4 replies
  • 11 kudos

Resolved! Databricks pyspark - Find columns in xls file.

Hello everyone, every day I extract data into xls files but the column position changes every day. Is there any way to find these columns within the file?Here's a snippet of my code.df = spark.read.format("com.crealytics.spark.excel")\ .option("hea...

  • 2378 Views
  • 4 replies
  • 11 kudos
Latest Reply
Vidula
Honored Contributor
  • 11 kudos

Hi @welder martins​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

  • 11 kudos
3 More Replies
Sam
by New Contributor III
  • 4197 Views
  • 3 replies
  • 6 kudos

Resolved! QuantileDiscretizer not respecting NumBuckets

I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.I have tried setting the relat...

  • 4197 Views
  • 3 replies
  • 6 kudos
Latest Reply
Sam
New Contributor III
  • 6 kudos

Thank you.What I did was:Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.That fixed the issue! You can define your own splits which would work as well but the splits thems...

  • 6 kudos
2 More Replies
Raymond_Garcia
by Contributor II
  • 3929 Views
  • 3 replies
  • 5 kudos

Resolved! Manipulate Column that is an array of objects

I have a column that is an array of objects, let's call it ARRAY, and now I would like to query / manipulate, the elements object without using explode function, this is an example, for each element in that column I would like to create a path. .wit...

  • 3929 Views
  • 3 replies
  • 5 kudos
Latest Reply
Raymond_Garcia
Contributor II
  • 5 kudos

Hello I am working with Scala, and I used somehing similar:def play(col: Column): Column = { concat_ws("", lit(imagePath), lit("/"), col("field1"), lit("/"), col("field2"), lit(".ext"))}val variable = spark.lot_of_stuff.                 .withColumn("...

  • 5 kudos
2 More Replies
AmanSehgal
by Honored Contributor III
  • 6572 Views
  • 1 replies
  • 10 kudos

Resolved! How to merge all the columns into one column as JSON?

I have a task to transform a dataframe. The task is to collect all the columns in a row and embed it into a JSON string as a column.Source DF:Target DF: 

image image
  • 6572 Views
  • 1 replies
  • 10 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 10 kudos

I was able to do this by converting df to rdd and then by applying map function to it.rdd_1 = df.rdd.map(lambda row: (row['ID'], row.asDict() ) )   ...

  • 10 kudos
thushar
by Contributor
  • 4177 Views
  • 5 replies
  • 3 kudos

Resolved! dataframe.rdd.isEmpty() is throwing error in 9.1 LTS

Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error. org.apache.spark.SparkExc...

  • 4177 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@Thushar R​ - Thank you for your patience. We are looking for the best person to help you.

  • 3 kudos
4 More Replies
Labels