Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
GoalImport and consolidate GBs / TBs of local data in 20-mb chunk parquet files into Databricks / Delta lake / partitioned tables.What I've DoneI took a small subset of data, roughly 72.5 GB and ingested using streaming below. The data is already seq...
Hi @Erik Louie Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!
I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...
The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...
I tried to use Spark as much as possible but experience some regression. Hopefully to get some direction how to use it correctly.I've created a Databricks table using spark.sqlspark.sql('select * from example_view ') \
.write \
.mode('overwr...
Hi, @Vincent Doe ,Updates are available in Delta tables, but under the hood you are updating parquet files, it means that each update needs to find the file where records are stored, then re-write the file to new version, and make new file current v...
Along withh several other issues I'm encountering, I am finding pandas dataframe to_sql being very slowI am writing to an Azure SQL database and performance is woeful. This is a test database and it has S3 100DTU and one user, me as it's configuratio...
Hi @Peter McLarty Does @Debayan Mukherjee response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?We'd love to hear from you.Thanks!
I want to transform a DF with a simple UDF. Afterwards I want to store the resulting DF in a new table (see code below)key = "test_key"
schema = StructType([
StructField("***", StringType(), True),
StructField("yyy", StringType(), True),
StructF...
Hello @Jan R Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.Thanks!