Data Engineering

Forum Posts

Sorted by:

by narek_margaryan • New Contributor II

10-06-2021 12:51:06 PM

3009 Views
1 replies
3 kudos

Resolved! Do Spark nodes read data from storage in a sequence?

I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...

Data Engineering

3009 Views
1 replies
3 kudos

10-06-2021 12:51:06 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-08-2021 12:11:36 AM

3 kudos

@Narek Margaryan , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...

3 kudos

10-08-2021 12:11:36 AM

by User16790091296 • Contributor II

06-24-2021 8:30:45 AM

1880 Views
1 replies
0 kudos

How to read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How c...

Data Engineering

1880 Views
1 replies
0 kudos

06-24-2021 8:30:45 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-24-2021 1:17:33 PM

0 kudos

What is the format of the table - if It is delta, you could use the python bindings for the native Rust API and read the table from your python code and compare bypassing the metastore.

0 kudos

06-24-2021 1:17:33 PM

by User16783853906 • Contributor III

06-10-2021 2:47:11 PM

3313 Views
2 replies
0 kudos

How does running VACUUM on Delta Lake tables effect read/write performance?

If I don't run VACUUM on a Delta Lake table, will that make my read performance slower?

Data Engineering

3313 Views
2 replies
0 kudos

06-10-2021 2:47:11 PM

View Replies

Latest Reply

User16783853906
Contributor III

06-23-2021 2:24:26 PM

0 kudos

VACUUM has no effect on read/write performance to that table. Never running VACUUM on a table will not make read/write performance to a Delta Lake table any slower.If you run VACUUM very infrequently, your VACUUM runtimes themselves may be pretty hig...

0 kudos

06-23-2021 2:24:26 PM

1 More Replies

by tourist_on_road • New Contributor

12-12-2019 4:47:16 PM

6413 Views
1 replies
0 kudos

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.from io importStringIO import array img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(featur...

Data Engineering

6413 Views
1 replies
0 kudos

12-12-2019 4:47:16 PM

View Replies

Latest Reply

shyam_9
Databricks Employee

12-16-2019 10:00:26 PM

0 kudos

Hi @tourist_on_road, please go through the below spark docs,https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

0 kudos

12-16-2019 10:00:26 PM

Databricks Community

Resolved! Do Spark nodes read data from storage in a sequence?

How to read a Databricks table via Databricks api in Python?

How does running VACUUM on Delta Lake tables effect read/write performance?

How to read binary data in pyspark