cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Finding multiple substrings from a DataFrame column dynamically?

Tejas1987
New Contributor II

Hello friends,

I have a DataFrame with specific values. I am trying to find specific values out of it.

   

*I/P -

|ID | text |

|:--|:------|

|1 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%ABC%') OR (Col3 Like '%DEF%') OR (Col3 Like '%EFG%'))|

|2 | select distinct Col1 as OrderID from Table1 WHERE (Col2 = 1234) AND( (Col3 Like '%XYZ%') OR (Col3 Like '%PQR%'))|

|3 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%MNO%') OR (Col3 Like '%PQR%'))|

|4 | select distinct Col1 as OrderID, 1 as Duration, 1 as Load from Table1 WHERE (Col3 Like '%PQR%')|

|5 | select distinct Col1 as OrderID from Table1 WHERE ( (Col4 = 'AA') OR (Col4 = 'BB') OR (Col4 = 'CC') OR (Col4 = 'DD')) AND (Col3 Like '%XYZ%')|

|6 | select distinct Col1 as OrderID from Table1 WHERE (Col1 = 1234)|

**O/P -

|ID | text_codes |

|:--|:------------|

|1 | ['ABC', 'DEF', 'EFG'] |

|2 | ['XYZ', 'PQR'] |

|3 | ['MNO', 'PQR'] |

|4 | ['PQR'] |

|5 | ['XYZ'] |

|6 | [] |

steps I have tried - 

I got indexes of each code based on "Col3" keyword plus specific number of words, so intermediate output is

* O/P - 

|ID | codes_idx |

|:--|:----------|

|1 | 71;94;117 |

|2 | 88;111 |

|3 | 71;94 |

|4 | 95 |

|5 | 141 |

|6 | |

Code for it - 

    def toString(x):
 
    	l = list(x)
 
    	ll = [str(i) for i in l]
 
    	return ';'.join(ll)
 
     
 
    def getCodes(str1):
 
    	findStr = '(Col3 Like '
 
    	res = [i+14 for i in range(len(str1)) if str1.startswith(findStr, i) ]
 
    	return toString(res)
 
     
 
    result = df2_query1.rdd.map(lambda x: (x[0], getCodes(x[0]) ) )
 
    result.collect()
 
    df_result = spark.createDataFrame(result)
 

But when I tried to get substring function applied here.

    res_list = []
 
     
 
    for i in range(len(str1)):
 
    	idxi = str1.startswith(findStr, i)
 
    	res_str = substring(str1,idxi+14,3)
 
    	res_list.append(res_str)
 

I get Attribute Error -

    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 64.0 failed 4 times, most recent failure: Lost task 0.3 in stage 64.0 (TID 118) (10.139.64.5 executor 1): org.apache.spark.api.python.PythonException: 'AttributeError: 'NoneType' object has no attribute '_jvm'', from <command-1021210642233982>, line 9. Full traceback below:

I am not sure what I am doing wrong here. Any help is appreciated.

Thanks,

Tejas

1 ACCEPTED SOLUTION

Accepted Solutions

AmanSehgal
Honored Contributor III

What is the logic for substring function?

Can't you use str1[idxi+14:3] for substring?

View solution in original post

2 REPLIES 2

AmanSehgal
Honored Contributor III

What is the logic for substring function?

Can't you use str1[idxi+14:3] for substring?

Hello Melbourne,

It worked with the expression you suggested. I had to just make small changes to final solution, but it is giving me the output I wanted.

I used -

str1[str(idxi)+14:str(idxi)+17] as my expression to get specific values in the function and applied map to get necessary output.

Thank you.

Regards,

Tejas Parlikar

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group