cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

PrasadGaikwad
by New Contributor
  • 11907 Views
  • 1 replies
  • 0 kudos

TypeError: Column is not iterable when using more than one columns in withColumn()

I am trying to find quarter start date from a date column. I get the expected result when i write it using selectExpr() but when i add the same logic in .withColumn() i get TypeError: Column is not iterableI am using a workaround as follows workarou...

  • 11907 Views
  • 1 replies
  • 0 kudos
Latest Reply
emma_s
Databricks Employee
  • 0 kudos

Hi, This is a super old question but answering in case anyone else comes across it. This isn't working because the add months expects an integer rather than a column name you can get round this by using expr() inside withColumn: from pyspark.sql.func...

  • 0 kudos
manugarri
by New Contributor II
  • 23125 Views
  • 13 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 23125 Views
  • 13 replies
  • 2 kudos
Latest Reply
RheaC
New Contributor II
  • 2 kudos

+1 on LLMs. I would check this article on using Similarity API instead of rapidfuzz in 2026 especially for larger/growing datasets https://medium.com/p/0854593e380a

  • 2 kudos
12 More Replies
Adig
by New Contributor III
  • 9385 Views
  • 6 replies
  • 17 kudos

Generate Group Id for similar deduplicate values of a dataframe column.

Inupt DataFrame'''KeyName KeyCompare SourcePapasMrtemis PapasMrtemis S1PapasMrtemis Pappas, Mrtemis S1Pappas, Mrtemis PapasMrtemis S2Pappas, Mrtemis Pappas, Mrtemis S2Mich...

  • 9385 Views
  • 6 replies
  • 17 kudos
Latest Reply
rafaelpoyiadzi
New Contributor II
  • 17 kudos

Hey. We’ve run into similar deduplication problems before. If the name differences are pretty minor (punctuation, spacing, small typos), fuzzy string matching can usually get you most of the way there. That kind of similarity-based clustering works f...

  • 17 kudos
5 More Replies
Nandini
by New Contributor II
  • 18176 Views
  • 12 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 18176 Views
  • 12 replies
  • 7 kudos
Latest Reply
Etyr
Contributor II
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
11 More Replies
boskicl
by New Contributor III
  • 41704 Views
  • 8 replies
  • 12 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

filtering job_info spill_memory
  • 41704 Views
  • 8 replies
  • 12 kudos
Latest Reply
nvashisth
New Contributor III
  • 12 kudos

@timo199 , @boskicl I had similar issue and job was getting stuck at Filtering Files for Query indefinitely. I checked SPARK logs and based on that figured out that we had enabled PHOTON acceleration on our cluster for job and datatype of our columns...

  • 12 kudos
7 More Replies
Kamal2
by Databricks Partner
  • 28166 Views
  • 5 replies
  • 7 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

  • 28166 Views
  • 5 replies
  • 7 kudos
Latest Reply
Mykola_Melnyk
New Contributor III
  • 7 kudos

PDF Data Source works now on Databricks.Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/

  • 7 kudos
4 More Replies
jsaddam28
by New Contributor III
  • 61209 Views
  • 25 replies
  • 16 kudos

How to import local python file in notebook?

for example I have one.py and two.py in databricks and I want to use one of the module from one.py in two.py. Usually I do this in my local machine by import statement like below two.py__ from one import module1 . . . How to do this in databricks???...

  • 61209 Views
  • 25 replies
  • 16 kudos
Latest Reply
PabloCSD
Valued Contributor II
  • 16 kudos

This alternative worked for us: https://community.databricks.com/t5/data-engineering/is-it-possible-to-import-functions-from-a-module-in-workspace/td-p/5199

  • 16 kudos
24 More Replies
vanshikagupta
by New Contributor II
  • 9482 Views
  • 3 replies
  • 0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

  • 9482 Views
  • 3 replies
  • 0 kudos
Latest Reply
thelogicplus
Contributor II
  • 0 kudos

you may explore the tool and services from Travinto Technologies . They have very good tools. We had explored their tool for our code coversion from  Informatica, Datastage and abi initio to DATABRICKS , pyspark. Also we used for SQL queries, stored ...

  • 0 kudos
2 More Replies
RantoB
by Valued Contributor
  • 31387 Views
  • 8 replies
  • 7 kudos

Resolved! read csv directly from url with pyspark

I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_fo...

  • 31387 Views
  • 8 replies
  • 7 kudos
Latest Reply
anwangari
New Contributor II
  • 7 kudos

Hello it's end of 2024 and I still have this issue with python. As mentioned sc method nolonger works. Also, working with volumes within "/databricks/driver/" is not supported in Apache Spark.ALTERNATIVE SOLUTION: Use requests to download the file fr...

  • 7 kudos
7 More Replies
RateVan
by New Contributor II
  • 5462 Views
  • 4 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 5462 Views
  • 4 replies
  • 0 kudos
Latest Reply
Dtank
New Contributor II
  • 0 kudos

Do you have any solution for this ?

  • 0 kudos
3 More Replies
avnish26
by New Contributor III
  • 15450 Views
  • 5 replies
  • 9 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 15450 Views
  • 5 replies
  • 9 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 9 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 9 kudos
4 More Replies
William_Scardua
by Valued Contributor
  • 3683 Views
  • 1 replies
  • 3 kudos

How to use Pylint to check your pyspark code quality ?

Hi guys,I would like to use the Pylint to check my pyspark scripts, do you do that ?Thank you ?

  • 3683 Views
  • 1 replies
  • 3 kudos
Latest Reply
developer_lumo
New Contributor II
  • 3 kudos

Currently I am working on Databricks (Notebooks) and have the same issue as unable to find a linter that is well integrated with Python, Pyspark and databricks notebooks. 

  • 3 kudos
Arpi
by New Contributor II
  • 5536 Views
  • 4 replies
  • 4 kudos

Resolved! Database creation error

I am trying to create database with external location abfss but facing the below error.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs....

  • 5536 Views
  • 4 replies
  • 4 kudos
Latest Reply
source2sea
Contributor
  • 4 kudos

Changing it to a CLUSTER level for OAuth authentication helped me solve the problem.I wish the notebook AI bot could tell me the solution.before the changes, my configraiotn was at the notebook leve.and  it has below errorsAnalysisException: org.apac...

  • 4 kudos
3 More Replies
DJey
by New Contributor III
  • 27570 Views
  • 6 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 27570 Views
  • 6 replies
  • 2 kudos
Latest Reply
Amin112
New Contributor II
  • 2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

  • 2 kudos
5 More Replies
weldermartins
by Honored Contributor
  • 11763 Views
  • 6 replies
  • 10 kudos

Resolved! Spark - API Jira

Hello guys. I use pyspark in my daily life. A demand has arisen to collect information in Jira. I was able to do this via Talend ESB, but I wouldn't want to use different tools to get the job done. Do you have any example of how to extract data from ...

  • 11763 Views
  • 6 replies
  • 10 kudos
Latest Reply
Marty73
New Contributor II
  • 10 kudos

Hi,There is also a new Databricks for Jira add-on on the Atlassian Marketplace. It is easy to setup and exports are directly created within Jira. They can be one-time, scheduled, or real-time. It can also export additional Jira data such as Assets, C...

  • 10 kudos
5 More Replies
Labels