Removing special characters from UTF8 input for use in email addresses or login names

When working with non-US customers, users often have characters in their names like ë, ó, ç and so on. Most of the time, a ‘human process’ converts these to their simple equivalent of e, o and c for use in computerized systems.

When searching for such a mapping of special characters to ‘safe’ characters I had a hard time finding a good list or PowerShell method to automatically convert special characters to standard A-Z characters so I wrote one:

function get-sanitizedUTF8Input{
    Param(
        [String]$inputString
    )
    $replaceTable = @{"ß"="ss";"à"="a";"á"="a";"â"="a";"ã"="a";"ä"="a";"å"="a";"æ"="ae";"ç"="c";"è"="e";"é"="e";"ê"="e";"ë"="e";"ì"="i";"í"="i";"î"="i";"ï"="i";"ð"="d";"ñ"="n";"ò"="o";"ó"="o";"ô"="o";"õ"="o";"ö"="o";"ø"="o";"ù"="u";"ú"="u";"û"="u";"ü"="u";"ý"="y";"þ"="p";"ÿ"="y"}

    foreach($key in $replaceTable.Keys){
        $inputString = $inputString -Replace($key,$replaceTable.$key)
    }
    $inputString = $inputString -replace '[^a-zA-Z0-9]', ''
    return $inputString
}

#example usage:
get-sanitizedUTF8Input -inputString "Jösè"
#result:
Jose

Edit: my colleague Gerbrand alerted me to a post by Grégory Schiro which solves this issue much more elegantly using native .NET functions. My slightly modified version to really ensure nothing non a-zA-Z0-9 gets past the function:

function Remove-DiacriticsAndSpaces
{
    Param(
        [String]$inputString
    )
    $objD = $inputString.Normalize([Text.NormalizationForm]::FormD)
    $sb = New-Object Text.StringBuilder
 
    for ($i = 0; $i -lt $objD.Length; $i++) {
        $c = [Globalization.CharUnicodeInfo]::GetUnicodeCategory($objD[$i])
        if($c -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
          [void]$sb.Append($objD[$i])
        }
      }
    
    $sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)
    return($sb -replace '[^a-zA-Z0-9]', '')
}
#example usage:
Remove-DiacriticsAndSpaces -inputString "Jösè"
#result:
Jose

And an even easier oneliner I converted to a function by John Seerden:

function Remove-DiacriticsAndSpaces
{
    Param(
        [String]$inputString
    )
    #replace diacritics
    $sb = [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))

    #remove spaces and anything the above function may have missed
    return($sb -replace '[^a-zA-Z0-9]', '')
}

And the most advanced function I’ve found so far is by 
Daniele Catanesi (PsCustomObject): https://github.com/PsCustomObject/New-StringConversion/blob/master/New-StringConversion.ps1 in which all features of above functions are supported and parameterized.

3
Leave a Reply

avatar
2 Comment threads
1 Thread replies
2 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
JosTom Cheang Recent comment authors

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  Subscribe  
newest oldest most voted
Notify of
Tom Cheang
Guest
Tom Cheang

Hi Jos, thank you for sharing this post. I am now able to tackle those pesky diacritics.

I did come across an error when using the Remove-DiacriticsAndSpaces function and wanted to post a comment for the next person that finds your post as useful. On Line 17, the $sb (Text.StringBuilder obj) does not contain a method named Normalize() but a [string] object does.

I made a small adjustment to Line 17 to first output as string, then was able to call the Normalize() method:

$sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)

trackback

[…] Removing Special Characters From UTF8 Input For Use In Email Addresses or Login Names […]