cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mado
by Valued Contributor II
  • 8272 Views
  • 4 replies
  • 2 kudos

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.Sample dataset:# Prepare Data data = [("Spain", 101, 201, 301), \ ("Taiwan", 102, 202, 302), \ ("Italy", 103, 203, 303), \ ("China", 104, 204, 304...

image image
  • 8272 Views
  • 4 replies
  • 2 kudos
Latest Reply
lukeoz
New Contributor II
  • 2 kudos

You can also use backticks around the column names that would otherwise be recognised as numbers.from pyspark.sql import functions as F   unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("C...

  • 2 kudos
3 More Replies
Braxx
by Contributor II
  • 5054 Views
  • 6 replies
  • 2 kudos

Resolved! issue with group by

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.Unfortunatelly I am getting the error:"TypeError:...

  • 5054 Views
  • 6 replies
  • 2 kudos
Latest Reply
Ralphma
New Contributor II
  • 2 kudos

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashab...

  • 2 kudos
5 More Replies
weldermartins
by Honored Contributor
  • 3634 Views
  • 5 replies
  • 10 kudos

Resolved! Spark - API Jira

Hello guys. I use pyspark in my daily life. A demand has arisen to collect information in Jira. I was able to do this via Talend ESB, but I wouldn't want to use different tools to get the job done. Do you have any example of how to extract data from ...

  • 3634 Views
  • 5 replies
  • 10 kudos
Latest Reply
Marty73
New Contributor II
  • 10 kudos

Hi,There is also a new Databricks for Jira add-on on the Atlassian Marketplace. It is easy to setup and exports are directly created within Jira. They can be one-time, scheduled, or real-time. It can also export additional Jira data such as Assets, C...

  • 10 kudos
4 More Replies
avnish26
by New Contributor III
  • 7367 Views
  • 4 replies
  • 8 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 7367 Views
  • 4 replies
  • 8 kudos
Latest Reply
jose_gonzalez
Moderator
  • 8 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 8 kudos
3 More Replies
Nastasia
by New Contributor II
  • 2510 Views
  • 3 replies
  • 1 kudos

Why is Spark creating multiple jobs for one action?

I noticed that when launching this bunch of code with only one action, I have three jobs that are launched.from pyspark.sql import DataFrame from pyspark.sql.types import StructType, StructField, StringType from pyspark.sql.functions import avgdata:...

https___i.stack.imgur.com_xfYDe.png LTHBM DdfHN
  • 2510 Views
  • 3 replies
  • 1 kudos
Latest Reply
RKNutalapati
Valued Contributor
  • 1 kudos

The above code will create two jobs.JOB-1. dataframe: DataFrame = spark.createDataFrame(data=data,schema=schema)The createDataFrame function is responsible for inferring the schema from the provided data or using the specified schema.Depending on the...

  • 1 kudos
2 More Replies
yutaro_ono1_558
by New Contributor II
  • 7709 Views
  • 5 replies
  • 1 kudos

Resolved! How to read data from S3 Access Point by pyspark?

I want to read data from s3 access point.I successfully accessed using boto3 client to data through s3 access point.s3 = boto3.resource('s3')ap = s3.Bucket('arn:aws:s3:[region]:[aws account id]:accesspoint/[S3 Access Point name]')for obj in ap.object...

  • 7709 Views
  • 5 replies
  • 1 kudos
Latest Reply
shrestha-rj
New Contributor II
  • 1 kudos

I'm reaching out to seek assistance as I navigate an issue. Currently, I'm trying to read JSON files from an S3 Multi-Region Access Point using a Databricks notebook. While reading directly from the S3 bucket presents no challenges, I encounter an "j...

  • 1 kudos
4 More Replies
Braxx
by Contributor II
  • 8759 Views
  • 3 replies
  • 1 kudos

Resolved! How to kill the execution of a notebook on specyfic cell?

Let's say I want to check if a condition is false then stop the execution of the rest of the script. I tried with two approaches:1) raising exceptionif not data_input_cols.issubset(data.columns): raise Exception("Missing column or column's name mis...

  • 8759 Views
  • 3 replies
  • 1 kudos
Latest Reply
Invasioned
New Contributor II
  • 1 kudos

In Jupyter notebooks or similar environments, you can stop the execution of a notebook at a specific cell by raising an exception. However, you need to handle the exception properly to ensure the execution stops. The issue you're encountering could b...

  • 1 kudos
2 More Replies
DJey
by New Contributor III
  • 5247 Views
  • 5 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 5247 Views
  • 5 replies
  • 2 kudos
Latest Reply
DJey
New Contributor III
  • 2 kudos

@Vidula Khanna​  Enabling the below property resolved my issue:spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled",True) Thanks v much!

  • 2 kudos
4 More Replies
RantoB
by Valued Contributor
  • 16692 Views
  • 8 replies
  • 4 kudos

Resolved! read csv directly from url with pyspark

I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_fo...

  • 16692 Views
  • 8 replies
  • 4 kudos
Latest Reply
MartinIsti
New Contributor III
  • 4 kudos

I know it's a 2 years old thread but I needed to find a solution to this very thing today. I had one notebook using SparkContextfrom pyspark import SparkFilesfrom pyspark.sql.functions import *sc.addFile(url) But according to the runtime 14 release n...

  • 4 kudos
7 More Replies
pvm26042000
by New Contributor III
  • 2026 Views
  • 4 replies
  • 2 kudos

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs?

  • 2026 Views
  • 4 replies
  • 2 kudos
Latest Reply
Sai1098
New Contributor II
  • 2 kudos

Vectorized Pandas UDFs offer improved performance compared to standard PySpark UDFs by leveraging the power of Pandas and operating on entire columns of data at once, rather than row by row.They provide a more intuitive and familiar programming inter...

  • 2 kudos
3 More Replies
lnights
by New Contributor II
  • 2376 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 2376 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
AryaMa
by New Contributor III
  • 19305 Views
  • 13 replies
  • 8 kudos

Resolved! reading data from url using spark

reading data form url using spark ,community edition ,got a path related error ,any suggestions please ? url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv" from pyspark import SparkFiles spark.sparkContext.addFil...

  • 19305 Views
  • 13 replies
  • 8 kudos
Latest Reply
padang
New Contributor II
  • 8 kudos

Sorry, bringing this back up...​from pyspark import SparkFiles url = "http://raw.githubusercontent.com/ltregan/ds-data/main/authors.csv" spark.sparkContext.addFile(url) df = spark.read.csv("file://"+SparkFiles.get("authors.csv"), header=True, inferSc...

  • 8 kudos
12 More Replies
giriraj01234567
by New Contributor II
  • 2644 Views
  • 1 replies
  • 2 kudos

getting error while runction show function

I was using String indexer, while fitting, transforming I didn't get any erro. but While runnign show function I am getting error, I mention the error beloworg.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 45.0 failed...

  • 2644 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Bojja Giri​ Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

  • 2 kudos
Ajay-Pandey
by Esteemed Contributor III
  • 7010 Views
  • 7 replies
  • 11 kudos

Resolved! Unzip Files

Hi all, I am trying to unzip a file in databricks but facing an issue,Please help me if you have any doc or codes to share.

  • 7010 Views
  • 7 replies
  • 11 kudos
Latest Reply
vivek_rawat
New Contributor III
  • 11 kudos

Hey ajay,You can follow this module to unzip your zip file.To give your brief idea about this, it will unzip your file directly into your driver node storage.So If your compressed data is inside DBFS then you first have to move that to drive node and...

  • 11 kudos
6 More Replies
carlosst01
by New Contributor II
  • 1207 Views
  • 2 replies
  • 2 kudos

Resolved! Running Libraries and/or modules in Databricks' lifecycle?

Hi, i have had this question for some weeks and didn't find any information about the topic. Specifically, my doubt is: what is the 'lifecycle' or cycle or steps to be able to use a new Python library in Databricks in terms of compatibility? For exam...

  • 1207 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Carlos Caravantes​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ans...

  • 2 kudos
1 More Replies
Labels