Databricks Community

Prototype998 · ‎07-13-2022

I'm using PySpark and Pathos to read numerous CSV files and create many DF, but I keep getting this problem.

dbx_error code for the same:-

from pathos.multiprocessing import ProcessingPool

def readCsv(path):

return spark.read.csv(path,header=True)

csv_file_list = [file[0][5:] for file in dbutils.fs.ls("/databricks-datasets/COVID/coronavirusdataset/") if file[1].endswith(".csv")]

pool = ProcessingPool(2)

results = pool.map(readCsv, csv_file_list)

Rishabh-Pandey · ‎12-22-2022

hey @Punit Chauhan refer this code

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

Rishabh Pandey

View solution in original post

AmanSehgal · ‎07-14-2022

You actually don't need to filter `.csv` files like that.

You can use `pathGlobFilter` to do a regex match for selecting files that matches provided regular expression.

df = spark.read.option("pathGlobFilter","*.csv").csv(upload_path)

Vidula · ‎09-04-2022

Hi @Punit Chauhan

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Prototype998 · ‎12-22-2022

@Ajay Pandey @Rishabh Pandey

Rishabh-Pandey · ‎12-22-2022

hey @Punit Chauhan refer this code

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

Rishabh Pandey