I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Databricks and store it all in a Delta Table, using Autoloader, and then using H2O Sparkling Water for the Anomaly Detection model. What do you recommend?
I have run into some memory issues when I read the Delta Table into a Pyspark Dataframe and try to pass it to an H2O Frame.
I'm using a cluster with 30 GB of memory and 4 cores and the Dataframe has shape (30448674, 213). The error I'm getting is the following: The spark driver has stopped unexpectedly and is restarting. Your notebook will be reattached automatically. Through an analysis in Ganglia UI my driver graph is always in red when executing the asH2OFrame() function.
This is the code I am using:
# Imports
import h2o
from pysparkling import *
from pyspark.sql import SparkSession
# Creates an H2OContext Object.
hc = H2OContext.getOrCreate(spark)
# Read the Data from Delta Table to PySpark Dataframe
df = spark.read.format("delta").load("path/to/delta_table")
# Converting the PySpark Dataframe to an H2OFrame
h2o_df = hc.asH2OFrame(df)
# H2O Model for Anomaly Detection
isolation_model = H2OIsolationForestEstimator(model_id, ntrees, seed)
isolation_model.train(training_frame = h2o_df)
# Making predictions
predictions = isolation_model.predict(h2o_df)