excavator-matt
Contributor

@Louis_Frolio 

I tried the Pandas on Spark approach.

How do I from Delta table into Pandas on Spark DataFrame. Is this the best way?

 
projects_df = spark.read.table("my_catalog.my_schema.my_project_table")
projects_spdf = ps.from_pandas(projects_df.toPandas())
 
It runs past the sentence transformers bit, but when I try to 

projects_spdf["text_embeddings"] = np_text_embeddings.tolist()

 
I get this strange error.

/databricks/spark/python/pyspark/pandas/frame.py in ?(psdf, this_column_labels, that_column_labels) 13467 def assign_columns( 13468 psdf: DataFrame, this_column_labels: List[Label], that_column_labels: List[Label] 13469 ) -> Iterator[Tuple["Series", Label]]: > 13470 assert len(key) == len(that_column_labels) 13471 # Note that here intentionally uses `zip_longest` that combine 13472 # that_columns. 13473 for k, this_label, that_label in zip_longest(

At least it isn't a memory issue, but the attempt to run standard pandas at least went past this
 
Perhaps my issues is not so much sentence transformers but how to go from a massive list of arrays back into Delta table. That why I am a bit hesitant about @jamesl , but I could give it a try. Maybe there is lazy load issue somewhere.