Thu Sep 19 05:47:01 UTC 2024: ## Ripgrep’s Unicode Support: Good, But Could Be Better
Ripgrep, a popular tool for searching text, boasts improved Unicode support thanks to its implementation of UTS#18 Level 1. While this surpasses the capabilities of many other regex engines, it still falls short of the ideal.
The author highlights the issue of ligatures, where characters like “fi” or “fl” are combined into a single glyph. Ripgrep currently struggles to recognize these ligatures, potentially leading to inaccurate search results.
The ideal solution would be to implement UTS#18 2.1 support, which addresses canonical equivalence issues. However, the author suggests a more practical approach: introducing a flag that allows users to normalize text before searching, ensuring consistent search results.
While normalization offers a potential solution, it comes with a performance cost. The author acknowledges that normalization would slow down Ripgrep significantly, potentially rendering it slower than even a basic grep tool written in Python.
The article also delves into the implications of ligatures in various contexts, including web pages, PDFs, and mobile browsers. It examines the use of ligatures in text editors and discusses existing solutions for handling them, including shell functions and the “rga-fzf” command.
Finally, the author raises concerns about the security implications of using external code for ligature handling and suggests the need for a secure design that prioritizes external code execution over accuracy.
This insightful article highlights the challenges of handling Unicode in text search tools and proposes potential solutions for improving Ripgrep’s capabilities.