Thursday, December 17, 2020

UTF-8 : Where Ινδία is bigger than India

A peculiar problem was once reported by a customer of our data storage product.
They were copying files from a Netapp solution to their newly setup IBM Storwize V7000 Unified system.
Many files were having names with Greek characters in them.  Some of those had really long names.
Files whose names were Greek and longer than 125 characters failed to copy.
Being the Linux flag bearer, the issue came my way.

My investigation involved checking all the possibilities of where the issue could be.  I wanted to check all the possibilities.  Robocopy, the tool used for copying the files.  Samba, the protocol used for copying the files.  The locale being used at the Unified system.  Since we use RHEL in IBM Storwize V7000 Unified system, I had opened a dialogue with Red Hat as well.

Investigation converged to the limit of file names as defined by the operating system in use, Linux, and file system in use, GPFS.  255 bytes was the limit in case of both.  In case of Linux, this came from NAME_MAX macro in limits.h header.

     #define NAME_MAX         255    /* # chars in a file name */

RHEL and GPFS both were using UTF-8 encoding for file names.  UTF-8 requires 2 bytes for storing one Greek character.  Hence files whose names were longer than 125 Greek characters were failing to copy.

 

Someday I want to do this experiment.  Update the NAME_MAX macro in limits.h and recompile, so that files having longer names could see light of the day.

No comments:

Post a Comment