cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Dean_Lovelace
by New Contributor III
  • 2824 Views
  • 3 replies
  • 4 kudos

What is the Pyspark equivalent of FSCK REPAIR TABLE?

I am using the delta format and occasionaly get the following error:-"xx.parquet referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement"FS...

  • 2824 Views
  • 3 replies
  • 4 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 4 kudos

## Delta check when a file was added %scala (oldest-version-available to newest-version-available).map { version => var df = spark.read.json(f"<delta-table-location>/_delta_log/$version%020d.json").where("add is not null").select("add.path") var ...

  • 4 kudos
2 More Replies
Data_Engineer3
by Contributor III
  • 11589 Views
  • 4 replies
  • 5 kudos

How can i use the same spark session from onenotebook to another notebook in databricks

I want to use the same spark session which created in one notebook and need to be used in another notebook in across same environment, Example, if some of the (variable)object got initialized in the first notebook, i need to use the same object in t...

  • 11589 Views
  • 4 replies
  • 5 kudos
Latest Reply
Manoj12421
Valued Contributor II
  • 5 kudos

You can use %run and then use the location of the notebook - %run "/folder/notebookname"

  • 5 kudos
3 More Replies
Merchiv
by New Contributor III
  • 7578 Views
  • 4 replies
  • 0 kudos

Difference between Databricks and local pyspark split.

I have noticed some inconsistent behavior between calling the 'split' fuction on databricks and on my local installation.Running it in a databricks notebook givesspark.sql("SELECT split('abc', ''), size(split('abc',''))").show()So the string is split...

image.png
  • 7578 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Ivo Merchiers​ :The behavior you are seeing is likely due to differences in the underlying version of Apache Spark between your local installation and Databricks. split() is a function provided by Spark's SQL functions, and different versions of Spa...

  • 0 kudos
3 More Replies
Nandini
by New Contributor II
  • 12514 Views
  • 10 replies
  • 7 kudos

Pyspark: You cannot use dbutils within a spark job

I am trying to parallelise the execution of file copy in Databricks. Making use of multiple executors is one way. So, this is the piece of code that I wrote in pyspark.def parallel_copy_execution(src_path: str, target_path: str): files_in_path = db...

  • 12514 Views
  • 10 replies
  • 7 kudos
Latest Reply
Etyr
Contributor
  • 7 kudos

If you have spark session, you can use Spark hidden File System:# Get FileSystem from SparkSession fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) # Get Path class to convert string path to FS path path = spark._...

  • 7 kudos
9 More Replies
danniely
by New Contributor II
  • 12178 Views
  • 1 replies
  • 2 kudos

Pyspark RDD fails with pytest

when I call RDD Apis during pytest, it seems like module "serializer.py" cannot find any other modules under pyspark.I've already looked up on the internet, and it seems like pyspark modules are not properly importing other referring modules.I see ot...

  • 12178 Views
  • 1 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@hyunho lee​ : It sounds like you are encountering an issue with PySpark's serializer not being able to find the necessary modules during testing with Pytest. One solution you could try is to set the PYTHONPATH environment variable to include the pat...

  • 2 kudos
quakenbush
by Contributor
  • 6837 Views
  • 1 replies
  • 0 kudos

Is there something like Oracle's VPD-Feature in Databricks?

Since I am porting some code from Oracle to Databricks, I have another specific question.In Oracle there's something called Virtual Private Database, VPD. It's a simple security feature used to generate a WHERE-clause which the system will add to a u...

  • 6837 Views
  • 1 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Roger Bieri​ :In Databricks, you can use the UserDefinedFunction (UDF) feature to create a custom function that will be applied to a DataFrame. You can use this feature to add a WHERE clause to a DataFrame based on the user context. Here's an exampl...

  • 0 kudos
elgeo
by Valued Contributor II
  • 4671 Views
  • 2 replies
  • 0 kudos

Trasform SQL Cursor using Pyspark in Databricks

We have a Cursor in DB2 which reads in each loop data from 2 tables. At the end of each loop, after inserting the data to a target table, we update records related to each loop in these 2 tables before moving to the next loop. An indicative example i...

  • 4671 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @ELENI GEORGOUSI​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
zeta_load
by New Contributor II
  • 1538 Views
  • 1 replies
  • 1 kudos

Resolved! Unique ID of table values is not unique anymore after merge every x-times

I have two tables with unique IDs:ID val ID val1 10 1 102 11 2 103 13 ...

  • 1538 Views
  • 1 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Lukas Goldschmied​ :There are a few reasons why you might be experiencing this issue:Data Skew: Data skew is a common problem in distributed computing when one or more nodes in the cluster have more data to process than others. This can lead to long...

  • 1 kudos
beer
by New Contributor II
  • 1644 Views
  • 3 replies
  • 0 kudos

Didn't receive my Databricks Spark 3.0 certification

I took the exam yesterday and passed the test. I haven't received any email from Databricks Academy. How long would it take to receive the certification?

  • 1644 Views
  • 3 replies
  • 0 kudos
Latest Reply
beer
New Contributor II
  • 0 kudos

This is resolved.

  • 0 kudos
2 More Replies
Erik_L
by Contributor II
  • 3595 Views
  • 2 replies
  • 1 kudos

Resolved! Pyspark read multiple Parquet type expansion failure

ProblemReading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.Attempts at resolvingUsing streaming filesRemoving delta caching, vectorizationUsing ,cache() explicitlyNotesThis...

  • 3595 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Erik Louie​ Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!Regards

  • 1 kudos
1 More Replies
Harun
by Honored Contributor
  • 9550 Views
  • 2 replies
  • 0 kudos

Issue with Pyspark GroupBy GroupedData

Hi Guys,I am working on streaming data movement from bronze to silver. My bronze table is having a entity_name column, based on the entity_name column i need to create multiple silver tables.I tried the below approach, But it is failing with error "'...

  • 9550 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Harun Raseed Basheer​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best ...

  • 0 kudos
1 More Replies
Direo
by Contributor II
  • 2816 Views
  • 2 replies
  • 1 kudos

Resolved! How does pyspark work in these two scenarios?

I have two scenarios with different outcomes:Scenario 1:from pyspark.sql.functions import *# create sample dataframesdf1 = spark.createDataFrame([(1, 2, 3), (2, 3, 4)], ["a", "b", "c"])df2 = spark.createDataFrame([(1, 5, 6, 7), (2, 8, 9, 10)], ["a", ...

  • 2816 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Direo Direo​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers y...

  • 1 kudos
1 More Replies
nirajtanwar
by New Contributor
  • 2087 Views
  • 2 replies
  • 2 kudos

To collect the elements of a SparkDataFrame and coerces them into an R dataframe.

Hello Everyone,I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pysparkI am trying this:event...

  • 2087 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Niraj Tanwar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thank...

  • 2 kudos
1 More Replies
Galdino
by New Contributor II
  • 4963 Views
  • 3 replies
  • 1 kudos

How to read a json from BytesIO with PySpark?

I want read a json from IO variable using PySpark.My code using pandas:io = BytesIO()ftp.retrbinary('RETR '+ file_name, io.write)io.seek(0)# With pandasdf = pd.read_json(io)What I tried using PySpark, but don't work: io = BytesIO() ftp.retrbinary('...

  • 4963 Views
  • 3 replies
  • 1 kudos
Latest Reply
Erik_L
Contributor II
  • 1 kudos

Just use pandas and follow with spark.createDataFrame(df)

  • 1 kudos
2 More Replies
weldermartins
by Honored Contributor
  • 15109 Views
  • 7 replies
  • 35 kudos

Resolved! pyspark - regexp_extract

hello everyone, I'm creating a regex expression to fetch only the value of a string, but some values ​​are negative. I am not able to create the rule to compose the negative value. can you help me?from pyspark.sql.functions import regexp_extract fro...

image
  • 15109 Views
  • 7 replies
  • 35 kudos
Latest Reply
ErinArmistead
New Contributor II
  • 35 kudos

Have you found the answer? If you are a student in college or school searching for free essay examples online, you may want to visit the website https://writinguniverse.com/free-essay-examples/soccer/ here you will find a vast collection of free essa...

  • 35 kudos
6 More Replies
Labels