cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

vanshikagupta
by New Contributor II
  • 7622 Views
  • 3 replies
  • 0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

  • 7622 Views
  • 3 replies
  • 0 kudos
Latest Reply
thelogicplus
New Contributor III
  • 0 kudos

you may explore the tool and services from Travinto Technologies . They have very good tools. We had explored their tool for our code coversion from  Informatica, Datastage and abi initio to DATABRICKS , pyspark. Also we used for SQL queries, stored ...

  • 0 kudos
2 More Replies
RantoB
by Valued Contributor
  • 22514 Views
  • 8 replies
  • 7 kudos

Resolved! read csv directly from url with pyspark

I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_fo...

  • 22514 Views
  • 8 replies
  • 7 kudos
Latest Reply
anwangari
New Contributor II
  • 7 kudos

Hello it's end of 2024 and I still have this issue with python. As mentioned sc method nolonger works. Also, working with volumes within "/databricks/driver/" is not supported in Apache Spark.ALTERNATIVE SOLUTION: Use requests to download the file fr...

  • 7 kudos
7 More Replies
RateVan
by New Contributor II
  • 2520 Views
  • 4 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 2520 Views
  • 4 replies
  • 0 kudos
Latest Reply
Dtank
New Contributor II
  • 0 kudos

Do you have any solution for this ?

  • 0 kudos
3 More Replies
avnish26
by New Contributor III
  • 11113 Views
  • 5 replies
  • 8 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 11113 Views
  • 5 replies
  • 8 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 8 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 8 kudos
4 More Replies
Kamal2
by New Contributor II
  • 16619 Views
  • 3 replies
  • 4 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

  • 16619 Views
  • 3 replies
  • 4 kudos
Latest Reply
Mykola_Melnyk
New Contributor II
  • 4 kudos

Please look to the PDF DataSource for Apache Spark.This project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame. And here notebook with example of usage.df = spark.read.format("pdf") \ ...

  • 4 kudos
2 More Replies
William_Scardua
by Valued Contributor
  • 2439 Views
  • 1 replies
  • 3 kudos

How to use Pylint to check your pyspark code quality ?

Hi guys,I would like to use the Pylint to check my pyspark scripts, do you do that ?Thank you ?

  • 2439 Views
  • 1 replies
  • 3 kudos
Latest Reply
developer_lumo
New Contributor II
  • 3 kudos

Currently I am working on Databricks (Notebooks) and have the same issue as unable to find a linter that is well integrated with Python, Pyspark and databricks notebooks. 

  • 3 kudos
Arpi
by New Contributor II
  • 3548 Views
  • 4 replies
  • 4 kudos

Resolved! Database creation error

I am trying to create database with external location abfss but facing the below error.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs....

  • 3548 Views
  • 4 replies
  • 4 kudos
Latest Reply
source2sea
Contributor
  • 4 kudos

Changing it to a CLUSTER level for OAuth authentication helped me solve the problem.I wish the notebook AI bot could tell me the solution.before the changes, my configraiotn was at the notebook leve.and  it has below errorsAnalysisException: org.apac...

  • 4 kudos
3 More Replies
DJey
by New Contributor III
  • 13897 Views
  • 6 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 13897 Views
  • 6 replies
  • 2 kudos
Latest Reply
Amin112
New Contributor II
  • 2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

  • 2 kudos
5 More Replies
weldermartins
by Honored Contributor
  • 7017 Views
  • 6 replies
  • 10 kudos

Resolved! Spark - API Jira

Hello guys. I use pyspark in my daily life. A demand has arisen to collect information in Jira. I was able to do this via Talend ESB, but I wouldn't want to use different tools to get the job done. Do you have any example of how to extract data from ...

  • 7017 Views
  • 6 replies
  • 10 kudos
Latest Reply
Marty73
New Contributor II
  • 10 kudos

Hi,There is also a new Databricks for Jira add-on on the Atlassian Marketplace. It is easy to setup and exports are directly created within Jira. They can be one-time, scheduled, or real-time. It can also export additional Jira data such as Assets, C...

  • 10 kudos
5 More Replies
greyfine
by New Contributor II
  • 11070 Views
  • 5 replies
  • 5 kudos

Hi Everyone , I was wondering if it is possible to have alerts set up on query level for pyspark notebooks that are run on schedule in databricks so if we have some expected result from it we can receive a mail alert ?

In Above you can see we have 3 workspaces - we have the alert option available in the sql workspace but not in our data science and engineering space , anyway we can incorporate this in our DS and Engineering space ?

image.png
  • 11070 Views
  • 5 replies
  • 5 kudos
Latest Reply
JKR
Contributor
  • 5 kudos

How can I receive call on teams/number/slack if any jobs fails?

  • 5 kudos
4 More Replies
Mado
by Valued Contributor II
  • 13858 Views
  • 4 replies
  • 3 kudos

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.Sample dataset:# Prepare Data data = [("Spain", 101, 201, 301), \ ("Taiwan", 102, 202, 302), \ ("Italy", 103, 203, 303), \ ("China", 104, 204, 304...

image image
  • 13858 Views
  • 4 replies
  • 3 kudos
Latest Reply
lukeoz
New Contributor II
  • 3 kudos

You can also use backticks around the column names that would otherwise be recognised as numbers.from pyspark.sql import functions as F   unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("C...

  • 3 kudos
3 More Replies
Braxx
by Contributor II
  • 9521 Views
  • 6 replies
  • 2 kudos

Resolved! issue with group by

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.Unfortunatelly I am getting the error:"TypeError:...

  • 9521 Views
  • 6 replies
  • 2 kudos
Latest Reply
Ralphma
New Contributor II
  • 2 kudos

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashab...

  • 2 kudos
5 More Replies
Nastasia
by New Contributor II
  • 4432 Views
  • 2 replies
  • 1 kudos

Why is Spark creating multiple jobs for one action?

I noticed that when launching this bunch of code with only one action, I have three jobs that are launched.from pyspark.sql import DataFrame from pyspark.sql.types import StructType, StructField, StringType from pyspark.sql.functions import avgdata:...

https___i.stack.imgur.com_xfYDe.png LTHBM DdfHN
  • 4432 Views
  • 2 replies
  • 1 kudos
Latest Reply
RKNutalapati
Valued Contributor
  • 1 kudos

The above code will create two jobs.JOB-1. dataframe: DataFrame = spark.createDataFrame(data=data,schema=schema)The createDataFrame function is responsible for inferring the schema from the provided data or using the specified schema.Depending on the...

  • 1 kudos
1 More Replies
yutaro_ono1_558
by New Contributor II
  • 10385 Views
  • 2 replies
  • 1 kudos

How to read data from S3 Access Point by pyspark?

I want to read data from s3 access point.I successfully accessed using boto3 client to data through s3 access point.s3 = boto3.resource('s3')ap = s3.Bucket('arn:aws:s3:[region]:[aws account id]:accesspoint/[S3 Access Point name]')for obj in ap.object...

  • 10385 Views
  • 2 replies
  • 1 kudos
Latest Reply
shrestha-rj
New Contributor II
  • 1 kudos

I'm reaching out to seek assistance as I navigate an issue. Currently, I'm trying to read JSON files from an S3 Multi-Region Access Point using a Databricks notebook. While reading directly from the S3 bucket presents no challenges, I encounter an "j...

  • 1 kudos
1 More Replies
Braxx
by Contributor II
  • 11647 Views
  • 3 replies
  • 1 kudos

Resolved! How to kill the execution of a notebook on specyfic cell?

Let's say I want to check if a condition is false then stop the execution of the rest of the script. I tried with two approaches:1) raising exceptionif not data_input_cols.issubset(data.columns): raise Exception("Missing column or column's name mis...

  • 11647 Views
  • 3 replies
  • 1 kudos
Latest Reply
Invasioned
New Contributor II
  • 1 kudos

In Jupyter notebooks or similar environments, you can stop the execution of a notebook at a specific cell by raising an exception. However, you need to handle the exception properly to ensure the execution stops. The issue you're encountering could b...

  • 1 kudos
2 More Replies
Labels