cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Finding multiple substrings from a DataFrame column dynamically?

Tejas1987
New Contributor II

Hello friends,

I have a DataFrame with specific values. I am trying to find specific values out of it.

   

*I/P -

|ID | text |

|:--|:------|

|1 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%ABC%') OR (Col3 Like '%DEF%') OR (Col3 Like '%EFG%'))|

|2 | select distinct Col1 as OrderID from Table1 WHERE (Col2 = 1234) AND( (Col3 Like '%XYZ%') OR (Col3 Like '%PQR%'))|

|3 | select distinct Col1 as OrderID from Table1 WHERE ( (Col3 Like '%MNO%') OR (Col3 Like '%PQR%'))|

|4 | select distinct Col1 as OrderID, 1 as Duration, 1 as Load from Table1 WHERE (Col3 Like '%PQR%')|

|5 | select distinct Col1 as OrderID from Table1 WHERE ( (Col4 = 'AA') OR (Col4 = 'BB') OR (Col4 = 'CC') OR (Col4 = 'DD')) AND (Col3 Like '%XYZ%')|

|6 | select distinct Col1 as OrderID from Table1 WHERE (Col1 = 1234)|

**O/P -

|ID | text_codes |

|:--|:------------|

|1 | ['ABC', 'DEF', 'EFG'] |

|2 | ['XYZ', 'PQR'] |

|3 | ['MNO', 'PQR'] |

|4 | ['PQR'] |

|5 | ['XYZ'] |

|6 | [] |

steps I have tried - 

I got indexes of each code based on "Col3" keyword plus specific number of words, so intermediate output is

* O/P - 

|ID | codes_idx |

|:--|:----------|

|1 | 71;94;117 |

|2 | 88;111 |

|3 | 71;94 |

|4 | 95 |

|5 | 141 |

|6 | |

Code for it - 

    def toString(x):
 
    	l = list(x)
 
    	ll = [str(i) for i in l]
 
    	return ';'.join(ll)
 
     
 
    def getCodes(str1):
 
    	findStr = '(Col3 Like '
 
    	res = [i+14 for i in range(len(str1)) if str1.startswith(findStr, i) ]
 
    	return toString(res)
 
     
 
    result = df2_query1.rdd.map(lambda x: (x[0], getCodes(x[0]) ) )
 
    result.collect()
 
    df_result = spark.createDataFrame(result)
 

But when I tried to get substring function applied here.

    res_list = []
 
     
 
    for i in range(len(str1)):
 
    	idxi = str1.startswith(findStr, i)
 
    	res_str = substring(str1,idxi+14,3)
 
    	res_list.append(res_str)
 

I get Attribute Error -

    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 64.0 failed 4 times, most recent failure: Lost task 0.3 in stage 64.0 (TID 118) (10.139.64.5 executor 1): org.apache.spark.api.python.PythonException: 'AttributeError: 'NoneType' object has no attribute '_jvm'', from <command-1021210642233982>, line 9. Full traceback below:

I am not sure what I am doing wrong here. Any help is appreciated.

Thanks,

Tejas

1 ACCEPTED SOLUTION

Accepted Solutions

AmanSehgal
Honored Contributor III

What is the logic for substring function?

Can't you use str1[idxi+14:3] for substring?

View solution in original post

2 REPLIES 2

AmanSehgal
Honored Contributor III

What is the logic for substring function?

Can't you use str1[idxi+14:3] for substring?

Hello Melbourne,

It worked with the expression you suggested. I had to just make small changes to final solution, but it is giving me the output I wanted.

I used -

str1[str(idxi)+14:str(idxi)+17] as my expression to get specific values in the function and applied map to get necessary output.

Thank you.

Regards,

Tejas Parlikar

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.