Data Engineering

Forum Posts

Sorted by:

by Leszek • Contributor

09-16-2022 4:46:08 AM

7518 Views
1 replies
2 kudos

IDENTITY columns generating every other number when merging

Hi,I'm doing merge to my Delta Table which has IDENTITY column:Id BIGINT GENERATED ALWAYS AS IDENTITYInserted data has in the id column every other number, like this:Is this expected behavior? Is there any workaround to make number increasing by 1?

Data Engineering

7518 Views
1 replies
2 kudos

09-16-2022 4:46:08 AM

View Replies

Latest Reply

Dataspeaksss
New Contributor II

10-26-2023 11:02:40 PM

2 kudos

Were you able to resolve it? I'm facing the same issue.

2 kudos

10-26-2023 11:02:40 PM

by Thor • New Contributor III

05-19-2023 3:07:11 AM

7660 Views
0 replies
0 kudos

What is the performance impact of changing dataSkippingNumIndexedCols from 32 to 16?

I already improved a lot the performances of our ETL (x20 !) but I still want to know where I can improve performances. I seems that tables stats and column indexing slow down a bit writings so I want to decrease dataSkippingNumIndexedCols to match t...

Data Engineering

7660 Views
0 replies
0 kudos

05-19-2023 3:07:11 AM

by Enthusiastic_Da • New Contributor II

05-17-2023 7:54:05 AM

7130 Views
0 replies
0 kudos

how to read columns dynamically using pyspark

I have a table called MetaData and what columns are needed in the select are stored in MetaData.columnsI would like to read columns dynamically from MetaData.columns and create a view based on that.csv_values = "col1, col2, col3, col4"df = spark.crea...

Data Engineering

7130 Views
0 replies
0 kudos

05-17-2023 7:54:05 AM

by fuselessmatt • Contributor

04-03-2023 7:01:13 AM

8610 Views
3 replies
0 kudos

Omitting columns in an INSERT statement does not seem to work despite meeting the requirements

We want to use the INSERT INTO command with specific columns as specified in the official documentation. The only requirements for this are️ Databricks SQL warehouse version 2022.35 or higher️ Databricks Runtime 11.2 and aboveand the behaviour shou...

Data Engineering

8610 Views
3 replies
0 kudos

04-03-2023 7:01:13 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-03-2023 10:31:32 PM

0 kudos

Hi @Fusselmanwog Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers ...

0 kudos

04-03-2023 10:31:32 PM

2 More Replies

by hv • New Contributor

03-30-2023 9:43:07 AM

4649 Views
1 replies
0 kudos

Error-"'Column' object is not callable".

I am trying to lowercase one of the columns(A_description) of a dataframe(df) and getting the error-"'Column' object is not callable".Code: def new_desc(): for line in df: line = df['A_description'].map(str.lower) return line new_desc()Have used...

Data Engineering

4649 Views
1 replies
0 kudos

03-30-2023 9:43:07 AM

View Replies

Latest Reply

Chaitanya_Raju
Honored Contributor

03-31-2023 8:16:11 PM

0 kudos

Hi @Himadri Verma Hope this below suggestion will help you in pyspark.Please let me know if you are looking for something elseHappy Learning!!

0 kudos

03-31-2023 8:16:11 PM

by STummala • New Contributor

03-01-2023 7:02:44 AM

2253 Views
2 replies
0 kudos

how to dynamically perform aggregation on all columns in a data frame even when some columns have different types like int , double string datetime or float in pyspark (i have 140-200 columns and need to perform aggregation/avg on each column)

need to aggregate all the numerical columns but need to this dynamically

Data Engineering

2253 Views
2 replies
0 kudos

03-01-2023 7:02:44 AM

View Replies

Latest Reply

Anonymous
Not applicable

03-05-2023 11:16:13 PM

0 kudos

Hi @sandeep tummala , Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your fe...

0 kudos

03-05-2023 11:16:13 PM

1 More Replies

by ks1248 • New Contributor III

01-03-2023 9:51:17 AM

3013 Views
2 replies
5 kudos

Resolved! Autoloader creates columns not present in the source

I have been exploring Autoloader to ingest gzipped JSON files from an S3 source.The notebook fails in the first run due to schema mismatch, after re-running the notebook, the schema evolves and the ingestion runs successfully.On analysing the schema ...

Data Engineering

3013 Views
2 replies
5 kudos

01-03-2023 9:51:17 AM

View Replies

Latest Reply

ks1248
New Contributor III

01-13-2023 2:26:46 AM

5 kudos

Hi @Debayan Mukherjee , @Kaniz Fatma Thank you for replying to my question.I was able to figure out the issue. I was creating the schema and checkpoint folders in the same path as the source location for the autoloader. This caused the schema to ch...

5 kudos

01-13-2023 2:26:46 AM

1 More Replies

by lmcglone • New Contributor II

01-11-2023 8:08:37 AM

6662 Views
2 replies
3 kudos

Comparing 2 dataframes and create columns from values within a dataframe

Hi,I have a dataframe that has name and companyfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()columns = ["company","name"]data = [("company1", "Jon"), ("company2", "Steve"), ("company1", "...

Data Engineering

6662 Views
2 replies
3 kudos

01-11-2023 8:08:37 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-11-2023 8:59:13 AM

3 kudos

You need to join and pivotdf .join(df2, on=[df.company == df2.job_company])) .groupBy("company", "name") .pivot("job_company") .count()

3 kudos

01-11-2023 8:59:13 AM

1 More Replies

by yzaehringer • New Contributor

01-06-2023 2:02:52 PM

1986 Views
1 replies
0 kudos

GET_COLUMNS fails with Unexpected character (\\'t\\' (code 116)): was expecting comma to separate Object entries - how to fix?

I just run `cursor.columns()` via the python client and I'll get back a `org.apache.hive.service.cli.HiveSQLException` as response. There is also a long stack trace, I'll just paste the last bit because it might be illuminating: org.apache.spark.sql....

Data Engineering

1986 Views
1 replies
0 kudos

01-06-2023 2:02:52 PM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

01-07-2023 8:04:55 AM

0 kudos

this can be package issue or runtime issue, try to change both

0 kudos

01-07-2023 8:04:55 AM

by Magnus • Contributor

11-09-2022 6:53:15 AM

1571 Views
0 replies
3 kudos

How to specify which columns to use when using DLT APPLY CHANGES INTO

I have a notebook with the code below, where I try to do an upsert into a dimension table and only include one column from the source table. I get an error even though I think the syntax matches what I see in the docs. How can I write this in the cor...

Data Engineering

1571 Views
0 replies
3 kudos

11-09-2022 6:53:15 AM

by weldermartins • Honored Contributor

09-03-2022 8:14:54 AM

2779 Views
4 replies
11 kudos

Resolved! Databricks pyspark - Find columns in xls file.

Hello everyone, every day I extract data into xls files but the column position changes every day. Is there any way to find these columns within the file?Here's a snippet of my code.df = spark.read.format("com.crealytics.spark.excel")\ .option("hea...

Data Engineering

2779 Views
4 replies
11 kudos

09-03-2022 8:14:54 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-20-2022 10:20:47 PM

11 kudos

Hi @welder martins Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Tha...

11 kudos

09-20-2022 10:20:47 PM

3 More Replies

by Sam • New Contributor III

09-02-2021 3:39:48 PM

4665 Views
3 replies
6 kudos

Resolved! QuantileDiscretizer not respecting NumBuckets

I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.I have tried setting the relat...

Data Engineering

4665 Views
3 replies
6 kudos

09-02-2021 3:39:48 PM

View Replies

Latest Reply

Sam
New Contributor III

09-13-2021 6:19:17 PM

6 kudos

Thank you.What I did was:Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.That fixed the issue! You can define your own splits which would work as well but the splits thems...

6 kudos

09-13-2021 6:19:17 PM

2 More Replies

by Raymond_Garcia • Contributor II

05-20-2022 3:50:16 PM

4653 Views
3 replies
5 kudos

Resolved! Manipulate Column that is an array of objects

I have a column that is an array of objects, let's call it ARRAY, and now I would like to query / manipulate, the elements object without using explode function, this is an example, for each element in that column I would like to create a path. .wit...

Data Engineering

4653 Views
3 replies
5 kudos

05-20-2022 3:50:16 PM

View Replies

Latest Reply

Raymond_Garcia
Contributor II

05-23-2022 12:29:17 PM

5 kudos

Hello I am working with Scala, and I used somehing similar:def play(col: Column): Column = { concat_ws("", lit(imagePath), lit("/"), col("field1"), lit("/"), col("field2"), lit(".ext"))}val variable = spark.lot_of_stuff. .withColumn("...

5 kudos

05-23-2022 12:29:17 PM

2 More Replies

by AmanSehgal • Honored Contributor III

04-26-2022 5:15:17 AM

7471 Views
1 replies
10 kudos

Resolved! How to merge all the columns into one column as JSON?

I have a task to transform a dataframe. The task is to collect all the columns in a row and embed it into a JSON string as a column.Source DF:Target DF:

Data Engineering

7471 Views
1 replies
10 kudos

04-26-2022 5:15:17 AM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

04-27-2022 12:14:26 AM

10 kudos

I was able to do this by converting df to rdd and then by applying map function to it.rdd_1 = df.rdd.map(lambda row: (row['ID'], row.asDict() ) ) ...

10 kudos

04-27-2022 12:14:26 AM

by thushar • Contributor

01-18-2022 10:00:40 PM

4876 Views
5 replies
3 kudos

Resolved! dataframe.rdd.isEmpty() is throwing error in 9.1 LTS

Loaded a csv file with five columns into a dataframe, and then added around 15+ columns using dataframe.withColumn method.After adding these many columns, when I run the query df.rdd.isEmpty() - which throws the below error. org.apache.spark.SparkExc...

Data Engineering

4876 Views
5 replies
3 kudos

01-18-2022 10:00:40 PM

View Replies

Latest Reply

Anonymous
Not applicable

02-16-2022 9:04:24 AM

3 kudos

@Thushar R - Thank you for your patience. We are looking for the best person to help you.

3 kudos

02-16-2022 9:04:24 AM

4 More Replies