When working with non-US customers, users often have characters in their names like ë, ó, ç and so on. Most of the time, a ‘human process’ converts these to their simple equivalent of e, o and c for use in computerized systems.
When searching for such a mapping of special characters to ‘safe’ characters I had a hard time finding a good list or PowerShell method to automatically convert special characters to standard A-Z characters so I wrote one:
function get-sanitizedUTF8Input{
Param(
[String]$inputString
)
$replaceTable = @{"ß"="ss";"à"="a";"á"="a";"â"="a";"ã"="a";"ä"="a";"å"="a";"æ"="ae";"ç"="c";"è"="e";"é"="e";"ê"="e";"ë"="e";"ì"="i";"í"="i";"î"="i";"ï"="i";"ð"="d";"ñ"="n";"ò"="o";"ó"="o";"ô"="o";"õ"="o";"ö"="o";"ø"="o";"ù"="u";"ú"="u";"û"="u";"ü"="u";"ý"="y";"þ"="p";"ÿ"="y"}
foreach($key in $replaceTable.Keys){
$inputString = $inputString -Replace($key,$replaceTable.$key)
}
$inputString = $inputString -replace '[^a-zA-Z0-9]', ''
return $inputString
}
#example usage:
get-sanitizedUTF8Input -inputString "Jösè"
#result:
Jose
Edit: my colleague Gerbrand alerted me to a post by Grégory Schiro which solves this issue much more elegantly using native .NET functions. My slightly modified version to really ensure nothing non a-zA-Z0-9 gets past the function:
function Remove-DiacriticsAndSpaces
{
Param(
[String]$inputString
)
$objD = $inputString.Normalize([Text.NormalizationForm]::FormD)
$sb = New-Object Text.StringBuilder
for ($i = 0; $i -lt $objD.Length; $i++) {
$c = [Globalization.CharUnicodeInfo]::GetUnicodeCategory($objD[$i])
if($c -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
[void]$sb.Append($objD[$i])
}
}
$sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)
return($sb -replace '[^a-zA-Z0-9]', '')
}
#example usage:
Remove-DiacriticsAndSpaces -inputString "Jösè"
#result:
Jose
And an even easier oneliner I converted to a function by John Seerden:
function Remove-DiacriticsAndSpaces
{
Param(
[String]$inputString
)
#replace diacritics
$sb = [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))
#remove spaces and anything the above function may have missed
return($sb -replace '[^a-zA-Z0-9]', '')
}
And the most advanced function I’ve found so far is by
Daniele Catanesi (PsCustomObject): https://github.com/PsCustomObject/New-StringConversion/blob/master/New-StringConversion.ps1 in which all features of above functions are supported and parameterized.
Hi Jos, thank you for sharing this post. I am now able to tackle those pesky diacritics.
I did come across an error when using the Remove-DiacriticsAndSpaces function and wanted to post a comment for the next person that finds your post as useful. On Line 17, the $sb (Text.StringBuilder obj) does not contain a method named Normalize() but a [string] object does.
I made a small adjustment to Line 17 to first output as string, then was able to call the Normalize() method:
$sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)
[…] Removing Special Characters From UTF8 Input For Use In Email Addresses or Login Names […]