- 3009 Views
- 1 replies
- 3 kudos
I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...
- 3009 Views
- 1 replies
- 3 kudos
Latest Reply
@Narek Margaryan , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...
- 1880 Views
- 1 replies
- 0 kudos
Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How c...
- 1880 Views
- 1 replies
- 0 kudos
Latest Reply
What is the format of the table - if It is delta, you could use the python bindings for the native Rust API and read the table from your python code and compare bypassing the metastore.
- 3313 Views
- 2 replies
- 0 kudos
If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?
- 3313 Views
- 2 replies
- 0 kudos
Latest Reply
VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...
1 More Replies
- 6413 Views
- 1 replies
- 0 kudos
I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...
- 6413 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles