How can I use regex to clean/normalize user-submitted urls?

My bookmarking app allows users to submit any URL. I want to clean/normalize these URLs using regex. Eg strip off query params or any chars prior to http*.

Eg:

https://www.amazon.com/Thursday-Murder-Club-Novel/dp/B086DL5TVZ/ref=sr_1_1?crid=1PPMMYS04059R&dchild=1&keywords=thursday+murder+club&qid=1635454753&sprefix=thursday%2Caps%2C346&sr=8-1

is cleaned to:

https://www.amazon.com/Thursday-Murder-Club-Novel/dp/B086DL5TVZ/ref=sr_1_1

using regex = ([^?]+)(?.*)?

Bonus: Once I can do this for a single regex, Iā€™d like to chain regexes for more complex cleaning and standardization. So the result of applying regex1 feeds into regex2, etc.

If you want to regex it I think you can try this.

If you simply want to remove all the query parameters then I think you can try using a split text column on the ā€œ?ā€ character, then a single value column to retrieve the 1st part.

2 Likes

Works as expected.

2 Likes

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.