cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

sarvesh
by Contributor III
  • 1316 Views
  • 1 replies
  • 3 kudos

Audit Vertica tables in Spark!

I am trying to use Audit from Vertica in spark and getting correct table size from it, but the minimum size Audit function can find is bytes, but we are getting data in bits even smaller than bytes. val size = f"select audit('table_name');"

  • 1316 Views
  • 1 replies
  • 3 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 3 kudos

Rather everything will be in bytes. Spak sql have built in methods to get table size but also in bytes:spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

  • 3 kudos
sarvesh
by Contributor III
  • 5766 Views
  • 4 replies
  • 3 kudos

read percentage values in spark ( no casting )

I have a xlsx file which has a single column ;percentage30%40%50%-10%0.00%0%0.10%110%99.99%99.98%-99.99%-99.98%when i read this using Apache-Spark out put i get is,|percentage|+----------+| 0.3|| 0.4|| 0.5|| -0.1|| 0.0|| ...

  • 5766 Views
  • 4 replies
  • 3 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 3 kudos

Affirmative. This is how excel stores percentages. What you see is just cell formatting.Databricks notebooks do not (yet?) have the possibility to format the output.But it is easy to use a BI tool on top of Databricks, where you can change the for...

  • 3 kudos
3 More Replies
brickster_2018
by Databricks Employee
  • 1629 Views
  • 1 replies
  • 0 kudos
  • 1629 Views
  • 1 replies
  • 0 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 0 kudos

This is a lit of configuration keys to enable or alter the blacklist mechanism:spark.blacklist.enabled – set to Truespark.blacklist.task.maxTaskAttemptsPerExecutor (1 by default)spark.blacklist.task.maxTaskAttemptsPerNode (2 by default)spark.blacklis...

  • 0 kudos
sarvesh
by Contributor III
  • 4507 Views
  • 0 replies
  • 0 kudos

Can we read an excel file with many sheets with there indexes?

I am trying to read a excel file which has 3 sheets which have integers as there names,sheet 1 name = 21sheet 2 name = 24sheet 3 name = 224i got this data from a user so I can't change the sheet name, but with spark reading these is an issue.code -v...

  • 4507 Views
  • 0 replies
  • 0 kudos
sarvesh
by Contributor III
  • 7873 Views
  • 9 replies
  • 8 kudos

Resolved! Getting Null values at the place of data which was removed manually from excel file( solved )

I was reading an excel file with one column,country india India india India indiadataframe i got from this data : df.show()+-------+ |country| +-------+ | india | | India | | india | | India | | india | +-------+In the next step i removed last value ...

  • 7873 Views
  • 9 replies
  • 8 kudos
Latest Reply
Anonymous
Not applicable
  • 8 kudos

@sarvesh singh​ - Thank you for letting us know. Would you be happy to mark the best answer so others can find the solution easily?

  • 8 kudos
8 More Replies
sarvesh
by Contributor III
  • 4285 Views
  • 5 replies
  • 8 kudos

Catch rejected Data ( Rows ) while reading with Apache-Spark.

I work with Spark-Scala and I receive Data in different formats ( .csv/.xlxs/.txt etc ), when I try to read/write this data from different sources to a any database, many records got rejected due to various issues like (special characters, data type ...

  • 4285 Views
  • 5 replies
  • 8 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 8 kudos

or maybe schema evolution on delta lake is enough, in combination with Hubert's answer

  • 8 kudos
4 More Replies
Constantine
by Contributor III
  • 11296 Views
  • 4 replies
  • 4 kudos

Resolved! How does Spark do lazy evaluation?

For context, I am running Spark on databricks platform and using Delta Tables (s3). Let's assume we a table called table_one. I create a view called view_one using the table and then call view_one. Next, I create another view, called view_two based o...

  • 11296 Views
  • 4 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @John Constantine​ ,The following notebook url will help you to undertand better the difference between lazy transformations and action in Spark. You will be able to compare the physical query plans and undertand better what is going on when you e...

  • 4 kudos
3 More Replies
yitao
by New Contributor III
  • 3610 Views
  • 4 replies
  • 10 kudos

Resolved! How to make sparklyr extension work with Databricks runtime?

Hello. I'm the current maintainer of sparklyr (a R interface for Apache Spark) and a few sparklyr extensions such as sparklyr.flint.Sparklyr was fortunate to receive some contribution from Databricks folks, which enabled R users to run `spark_connect...

  • 3610 Views
  • 4 replies
  • 10 kudos
Latest Reply
Dan_Z
Databricks Employee
  • 10 kudos

Yes, as Sebastian said. Also, it would be good to know what the error is here. One possible explanation is that the JARs are not copied to the executor nodes. This would be solved by Sebasitian's suggestion.

  • 10 kudos
3 More Replies
Nazar
by New Contributor II
  • 6404 Views
  • 3 replies
  • 4 kudos

Resolved! Incremental write

Hi All,I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source t...

  • 6404 Views
  • 3 replies
  • 4 kudos
Latest Reply
Nazar
New Contributor II
  • 4 kudos

Thanks werners

  • 4 kudos
2 More Replies
Ougagagoubu
by New Contributor
  • 1236 Views
  • 0 replies
  • 0 kudos

FileBug in DBFS? Can not remove file (table) nor create it in Apache Spark (TM) SQL for Data Analysts Coursera course from Unit 6.2 onwards on.

Hello,as the title already suggests, i'm not able to remove a file via the shell (%sh rm -f "path") nor continue the notebook 6.2 onwards on (6.3 etc...) inside DataBricks. I'm using the DataBricks Community edition.While the error message is clear:"...

  • 1236 Views
  • 0 replies
  • 0 kudos
brickster_2018
by Databricks Employee
  • 2790 Views
  • 1 replies
  • 1 kudos

Resolved! How to run commands on the executor

Using %sh, I am able to run commands on the notebook and get output. How can i run a command on the executor and get the output. I want to avoid using the Spark API's

  • 2790 Views
  • 1 replies
  • 1 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 1 kudos

It's not possible to use %sh to run commands on the executor. The below code can be used to run commands on the executor and get the outputvar res=sc.runOnEachExecutor[String]({ () => import sys.process._ var cmd_Result=Seq("bash", "-c", "h...

  • 1 kudos
brickster_2018
by Databricks Employee
  • 4410 Views
  • 1 replies
  • 0 kudos
  • 4410 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The off-heap memory is managed outside the executor JVM. Spark has native support to use off-heap memory. The off-heap memory is managed by Spark and not controlled by the executor JVM. Hence GC cycles on the executor do not clean up off-heap. Databr...

  • 0 kudos
Labels