cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

narek_margaryan
by New Contributor II
  • 2533 Views
  • 1 replies
  • 3 kudos

Resolved! Do Spark nodes read data from storage in a sequence?

I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...

  • 2533 Views
  • 1 replies
  • 3 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 3 kudos

@Narek Margaryan​ , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...

  • 3 kudos
User16790091296
by Contributor II
  • 1581 Views
  • 1 replies
  • 0 kudos

How to read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How c...

  • 1581 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

What is the format of the table - if It is delta, you could use the python bindings for the native Rust API and read the table from your python code and compare bypassing the metastore.

  • 0 kudos
User16783853906
by Contributor III
  • 2682 Views
  • 2 replies
  • 0 kudos

How does running VACUUM on Delta Lake tables effect read/write performance?

If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?

  • 2682 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16783853906
Contributor III
  • 0 kudos

VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...

  • 0 kudos
1 More Replies
tourist_on_road
by New Contributor
  • 5583 Views
  • 1 replies
  • 0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

  • 5583 Views
  • 1 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

  • 0 kudos
Labels