Removing special characters from UTF8 input for use in email addresses or login names

When working with non-US customers, users often have characters in their names like ë, ó, ç and so on. Most of the time, a ‘human process’ converts these to their simple equivalent of e, o and c for use in computerized systems.

When searching for such a mapping of special characters to ‘safe’ characters I had a hard time finding a good list or PowerShell method to automatically convert special characters to standard A-Z characters so I wrote one:

function get-sanitizedUTF8Input{
    Param(
        [String]$inputString
    )
    $replaceTable = @{"ß"="ss";"à"="a";"á"="a";"â"="a";"ã"="a";"ä"="a";"å"="a";"æ"="ae";"ç"="c";"è"="e";"é"="e";"ê"="e";"ë"="e";"ì"="i";"í"="i";"î"="i";"ï"="i";"ð"="d";"ñ"="n";"ò"="o";"ó"="o";"ô"="o";"õ"="o";"ö"="o";"ø"="o";"ù"="u";"ú"="u";"û"="u";"ü"="u";"ý"="y";"þ"="p";"ÿ"="y"}

    foreach($key in $replaceTable.Keys){
        $inputString = $inputString -Replace($key,$replaceTable.$key)
    }
    $inputString = $inputString -replace '[^a-zA-Z0-9]', ''
    return $inputString
}

#example usage:
get-sanitizedUTF8Input -inputString "Jösè"
#result:
Jose

Edit: my colleague Gerbrand alerted me to a post by Grégory Schiro which solves this issue much more elegantly using native .NET functions. My slightly modified version to really ensure nothing non a-zA-Z0-9 gets past the function:

function Remove-DiacriticsAndSpaces
{
    Param(
        [String]$inputString
    )
    $objD = $inputString.Normalize([Text.NormalizationForm]::FormD)
    $sb = New-Object Text.StringBuilder
 
    for ($i = 0; $i -lt $objD.Length; $i++) {
        $c = [Globalization.CharUnicodeInfo]::GetUnicodeCategory($objD[$i])
        if($c -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
          [void]$sb.Append($objD[$i])
        }
      }
    
    $sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)
    return($sb -replace '[^a-zA-Z0-9]', '')
}
#example usage:
Remove-DiacriticsAndSpaces -inputString "Jösè"
#result:
Jose

And an even easier oneliner I converted to a function by John Seerden:

function Remove-DiacriticsAndSpaces
{
    Param(
        [String]$inputString
    )
    #replace diacritics
    $sb = [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))

    #remove spaces and anything the above function may have missed
    return($sb -replace '[^a-zA-Z0-9]', '')
}

And the most advanced function I’ve found so far is by 
Daniele Catanesi (PsCustomObject): https://github.com/PsCustomObject/New-StringConversion/blob/master/New-StringConversion.ps1 in which all features of above functions are supported and parameterized.

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

3 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
Tom Cheang
Tom Cheang
4 years ago

Hi Jos, thank you for sharing this post. I am now able to tackle those pesky diacritics.

I did come across an error when using the Remove-DiacriticsAndSpaces function and wanted to post a comment for the next person that finds your post as useful. On Line 17, the $sb (Text.StringBuilder obj) does not contain a method named Normalize() but a [string] object does.

I made a small adjustment to Line 17 to first output as string, then was able to call the Normalize() method:

$sb = $sb.ToString().Normalize([Text.NormalizationForm]::FormC)

trackback

[…] Removing Special Characters From UTF8 Input For Use In Email Addresses or Login Names […]