cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

TrinaDe
by New Contributor II
  • 3754 Views
  • 2 replies
  • 1 kudos

How can we join two pyspark dataframes side by side (without using join,equivalent to pd.concat() in pandas) ? I am trying to join two extremely large dataframes where each is of the order of 50 million.

My two dataframes look like new_df2_record1 and new_df2_record2 and the expected output dataframe I want is like new_df2: The code I have tried is the following: If I print the top 5 rows of new_df2, it gives the output as expected but I cannot pri...

0693f000007OoS6AAK
  • 3754 Views
  • 2 replies
  • 1 kudos
Latest Reply
TrinaDe
New Contributor II
  • 1 kudos

The code in a more legible format:

  • 1 kudos
1 More Replies
amitdatabricksc
by New Contributor II
  • 6317 Views
  • 2 replies
  • 0 kudos

AttributeError: 'NoneType' object has no attribute 'repartition'

I am using a framework and i have a query where i am doing,df = seg_df.select(*).write.option("compression", "gzip') and i am getting below error,When i don't do the write.option i am not getting below error. Why is it giving me repartition error. Wh...

  • 6317 Views
  • 2 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

hi @AMIT GADHAVI​ ,could you provide more details? for example, what is your data source? how do you repartition? etc

  • 0 kudos
1 More Replies
MartinB
by Contributor III
  • 8725 Views
  • 4 replies
  • 3 kudos

Resolved! Interoperability Spark ↔ Pandas: can't convert Spark dataframe to Pandas dataframe via df.toPandas() when it contains datetime value in distant future

Hi,I have multiple datasets in my data lake that feature valid_from and valid_to columns indicating validity of rows.If a row is valid currently, this is indicated by valid_to=9999-12-31 00:00:00.Example:Loading this into a Spark dataframe works fine...

Example_SCD2
  • 8725 Views
  • 4 replies
  • 3 kudos
Latest Reply
shan_chandra
Esteemed Contributor
  • 3 kudos

Currently, out of bound timestamps are not supported in pyArrow/pandas. Please refer to the below associated JIRA issue. https://issues.apache.org/jira/browse/ARROW-5359?focusedCommentId=17104355&page=com.atlassian.jira.plugin.system.issuetabpanels%3...

  • 3 kudos
3 More Replies
Anonymous
by Not applicable
  • 795 Views
  • 1 replies
  • 0 kudos
  • 795 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @User16143885715632505170 ! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a...

  • 0 kudos
Labels