We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?
I'm working on setting up tooling to allow team members to easily register and load models from a central mlflow model registry via dbconnect. However after following the instructions at the public docs , hitting this error raise _NoDbutilsError
mlfl...
G1GC can solve problems in some cases where garbage collection is a bottleneck. checkout https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
If it is on aws, consider using Nitro instances which gives this automatically. For more details check https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/data-protection.html#encryption-transit
Yes, multiple users could work on individual notebooks and still use the same experiment via mlflow.set_experiment(). You could also assign different permission levels to experiments from a governance point of view
You could mount an s3 bucket in the workspace and save your model using the mounts DBFS path For e.gmodelpath = "/dbfs/my-s3-bucket/model-%f-%f" % (alpha, l1_ratio)
mlflow.sklearn.save_model(lr, modelpath)