cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

manugarri
by New Contributor II
  • 18607 Views
  • 11 replies
  • 2 kudos

Fuzzy text matching in Spark

I have a list of client provided data, a list of company names. I have to match those names with an internal database of company names. The client list can fit in memory (its about 10k elements) but the internal dataset is on hdfs and we use Spark ...

  • 18607 Views
  • 11 replies
  • 2 kudos
Latest Reply
Edthehead
Contributor III
  • 2 kudos

You can refer to this article Optimizing Large-Scale Fuzzy Matching with Apache Spark and Databricks | by Gavaragirijarani | Medium.As far as open-source libraries go, rapidfuzz is known to be faster than fuzzywuzzy.

  • 2 kudos
10 More Replies
Nandini
by New Contributor II
  • 16028 Views
  • 12 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 16028 Views
  • 12 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
11 More Replies
boskicl
by New Contributor III
  • 37255 Views
  • 8 replies
  • 11 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

filtering job_info spill_memory
  • 37255 Views
  • 8 replies
  • 11 kudos
Latest Reply
nvashisth
New Contributor III
  • 11 kudos

@timo199 , @boskicl I had similar issue and job was getting stuck at Filtering Files for Query indefinitely. I checked SPARK logs and based on that figured out that we had enabled PHOTON acceleration on our cluster for job and datatype of our columns...

  • 11 kudos
7 More Replies
Kamal2
by New Contributor II
  • 25661 Views
  • 5 replies
  • 7 kudos

Resolved! PDF Parsing in Notebook

I have pdf files stored in azure adls.i want to parse pdf files in pyspark dataframeshow can i do that ?

  • 25661 Views
  • 5 replies
  • 7 kudos
Latest Reply
Mykola_Melnyk
New Contributor III
  • 7 kudos

PDF Data Source works now on Databricks.Instruction with example: https://stabrise.com/blog/spark-pdf-on-databricks/

  • 7 kudos
4 More Replies
jsaddam28
by New Contributor III
  • 56650 Views
  • 25 replies
  • 16 kudos

How to import local python file in notebook?

for example I have one.py and two.py in databricks and I want to use one of the module from one.py in two.py. Usually I do this in my local machine by import statement like below two.py__ from one import module1 . . . How to do this in databricks???...

  • 56650 Views
  • 25 replies
  • 16 kudos
Latest Reply
PabloCSD
Valued Contributor II
  • 16 kudos

This alternative worked for us: https://community.databricks.com/t5/data-engineering/is-it-possible-to-import-functions-from-a-module-in-workspace/td-p/5199

  • 16 kudos
24 More Replies
vanshikagupta
by New Contributor II
  • 8902 Views
  • 3 replies
  • 0 kudos

conversion of code from scala to python

does databricks community edition provides with databricks ML visualization for pyspark, just the same as provided in this link for scala. https://docs.azuredatabricks.net/_static/notebooks/decision-trees.html also please help me to convert this lin...

  • 8902 Views
  • 3 replies
  • 0 kudos
Latest Reply
thelogicplus
Contributor
  • 0 kudos

you may explore the tool and services from Travinto Technologies . They have very good tools. We had explored their tool for our code coversion from  Informatica, Datastage and abi initio to DATABRICKS , pyspark. Also we used for SQL queries, stored ...

  • 0 kudos
2 More Replies
RantoB
by Valued Contributor
  • 28395 Views
  • 8 replies
  • 7 kudos

Resolved! read csv directly from url with pyspark

I would like to load a csv file directly to a spark dataframe in Databricks. I tried the following code :url = "https://opendata.reseaux-energies.fr/explore/dataset/eco2mix-national-tr/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_fo...

  • 28395 Views
  • 8 replies
  • 7 kudos
Latest Reply
anwangari
New Contributor II
  • 7 kudos

Hello it's end of 2024 and I still have this issue with python. As mentioned sc method nolonger works. Also, working with volumes within "/databricks/driver/" is not supported in Apache Spark.ALTERNATIVE SOLUTION: Use requests to download the file fr...

  • 7 kudos
7 More Replies
RateVan
by New Contributor II
  • 4138 Views
  • 4 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 4138 Views
  • 4 replies
  • 0 kudos
Latest Reply
Dtank
New Contributor II
  • 0 kudos

Do you have any solution for this ?

  • 0 kudos
3 More Replies
avnish26
by New Contributor III
  • 13862 Views
  • 5 replies
  • 9 kudos

Spark 3.3.0 connect kafka problem

I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...

  • 13862 Views
  • 5 replies
  • 9 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 9 kudos

Hi @avnish26, did you added the Jar files to the cluster? do you still have issues? please let us know

  • 9 kudos
4 More Replies
William_Scardua
by Valued Contributor
  • 3254 Views
  • 1 replies
  • 3 kudos

How to use Pylint to check your pyspark code quality ?

Hi guys,I would like to use the Pylint to check my pyspark scripts, do you do that ?Thank you ?

  • 3254 Views
  • 1 replies
  • 3 kudos
Latest Reply
developer_lumo
New Contributor II
  • 3 kudos

Currently I am working on Databricks (Notebooks) and have the same issue as unable to find a linter that is well integrated with Python, Pyspark and databricks notebooks. 

  • 3 kudos
Arpi
by New Contributor II
  • 4892 Views
  • 4 replies
  • 4 kudos

Resolved! Database creation error

I am trying to create database with external location abfss but facing the below error.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs....

  • 4892 Views
  • 4 replies
  • 4 kudos
Latest Reply
source2sea
Contributor
  • 4 kudos

Changing it to a CLUSTER level for OAuth authentication helped me solve the problem.I wish the notebook AI bot could tell me the solution.before the changes, my configraiotn was at the notebook leve.and  it has below errorsAnalysisException: org.apac...

  • 4 kudos
3 More Replies
DJey
by New Contributor III
  • 21536 Views
  • 6 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 21536 Views
  • 6 replies
  • 2 kudos
Latest Reply
Amin112
New Contributor II
  • 2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

  • 2 kudos
5 More Replies
weldermartins
by Honored Contributor
  • 10228 Views
  • 6 replies
  • 10 kudos

Resolved! Spark - API Jira

Hello guys. I use pyspark in my daily life. A demand has arisen to collect information in Jira. I was able to do this via Talend ESB, but I wouldn't want to use different tools to get the job done. Do you have any example of how to extract data from ...

  • 10228 Views
  • 6 replies
  • 10 kudos
Latest Reply
Marty73
New Contributor II
  • 10 kudos

Hi,There is also a new Databricks for Jira add-on on the Atlassian Marketplace. It is easy to setup and exports are directly created within Jira. They can be one-time, scheduled, or real-time. It can also export additional Jira data such as Assets, C...

  • 10 kudos
5 More Replies
greyfine
by New Contributor II
  • 13894 Views
  • 5 replies
  • 5 kudos

Hi Everyone , I was wondering if it is possible to have alerts set up on query level for pyspark notebooks that are run on schedule in databricks so if we have some expected result from it we can receive a mail alert ?

In Above you can see we have 3 workspaces - we have the alert option available in the sql workspace but not in our data science and engineering space , anyway we can incorporate this in our DS and Engineering space ?

image.png
  • 13894 Views
  • 5 replies
  • 5 kudos
Latest Reply
JKR
Contributor
  • 5 kudos

How can I receive call on teams/number/slack if any jobs fails?

  • 5 kudos
4 More Replies
Mado
by Valued Contributor II
  • 18540 Views
  • 4 replies
  • 3 kudos

Resolved! Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.Sample dataset:# Prepare Data data = [("Spain", 101, 201, 301), \ ("Taiwan", 102, 202, 302), \ ("Italy", 103, 203, 303), \ ("China", 104, 204, 304...

image image
  • 18540 Views
  • 4 replies
  • 3 kudos
Latest Reply
lukeoz
New Contributor III
  • 3 kudos

You can also use backticks around the column names that would otherwise be recognised as numbers.from pyspark.sql import functions as F   unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("C...

  • 3 kudos
3 More Replies
Labels