Tuesday, February 20, 2024

The fsck of IBM Storage Scale

What is fsck
Wikipedia says : The system utility fsck (file system consistency check) is a tool for checking the consistency of a file system in Unix and Unix-like operating systems, such as Linux, macOS, and FreeBSD.


A file system is a method of storing, organizing, and managing the data in the available storage medium.
File system consistency refers to the correctness and validity of a file system.  Faults in a file system are usually caused by power failures, hardware failures, or improper shutdown of the system.

All file systems have their own ways of storing the data structures stored on the storage medium.  So fsck works with the data structures on disk.  Thus fsck is always available for a specific file system.  In other words, every file system must have its own fsck program.

Before running fsck to check the health of a file system, the file system must be unmounted.


Why is fsck needed

Considering the reliability of the hardware available these days and the robustness of the software, it is rare for a file system to have data corruptions. However, if in case data corruption happends in a file system, it needs to be detected and repaired.

In case of a power failure, the server may not shutdown correctly.  In this case, the data in the memory may not get written to the disk.  This creates inconsistency.  Such inconsistencies, if not corrected, may create further trouble later.

In rare cases, disks have bad sectors. The disks with bad sectors need to be replaced. Then fsck needs to be run to ensure data integrity.

If the applications are reporting input/output errors when accessing or storing data, the file system may have inconsistencies and running fsck is needed in such cases.



What is IBM Storage Scale
IBM's clustered file system that provides parallel data access from multiple nodes is branded as Storage Scale.

To know more about Storage Scale, please visit https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=overview-storage-scale

Here is a brief overview of IBM Storage Scale.
Storage Scale is a file system. But not an ordinary file system running locally on a single computer. Storage Scale runs on multiple computers. These computers together make a cluster. They are known as the nodes of the cluster.  Some of the nodes are arranged access to a storage that is present in the network. The storage is available to the nodes in the form of Network Shared Disks (NSDs). The available NSDs are used to create file system.  Customers use the file system to store and access data, via NFS, CIFS, or object protocols.

Storage Scale provides concurrent high-speed file access. Applications that access the data may be running on multiple systems and accessing the data in parallel.

A storage Scale cluster may consist of 1 to 56 nodes.  The nodes could be assigned the roles of quorum nodes and data nodes.  Rather than entering into the details of Storage Scale, let's focus on our agenda - fsck.



The fsck of IBM Storage Scale
As mentioned earlier, the fsck tool is always specific to a particular file system. For Storage Scale, IBM's engineers have written their own fsck program which is specific to Storage Scale.  It is named mmfsck.  All commands of Storage Scale begin with mm.  If you wonder why, mm stands for multi media.  This file system was developed 25 years ago.  In those days, having a big storage was a super luxury. Multi media was an emerging technology. And the storage required for multi media was supposed to be in huge quantity. As per the trend, all commands of the new file system were named with mm.

When IBM engineers developed this new file system, they developed the mmfsck tool as well. The file system has to be unmouted before running mmfsck on it. Are you thinking "this is a limitation"?  Well, although unmouting the file system for running fsck is a necessity, but some of us do think differently.  Why do I must have a downtime for running mmfsck? Why can't it be done while the file system is online and in use?  We took this thought forward and developed another variant of fsck which does not require the file system to be unmounted. It works while the file system is mounted and in use. We named it mmfsckx.  The x stands for eXtended.

So for Storage Scale now we have two variants of fsck.  mmfsck which requires the file system to be unmounted and mmfsckx which works while the file system is mounted and in use.  Typically we refer mmfsck as offline fsck and mmfsckx as online fsck.

Although the user may not be aware of this, the fsck is a separate program from the file system kernel code. Their intentions are also different. The fsck analyzes the file system’s metadata for the purpose of performing repairs. The kernel code manages the file system’s operations during normal usage.


Is Storage Scale the only file system that has online fsck?  Obviously not.  For example, see XFS Online Fsck Design and BeeGFS File System Check


What next?
Okay, you have a niche file system that allows to run fsck without unmounting, that means without a downtime involved. So can you improve what you already have?  Can you go beyond?  When this thought came to my mind, here are the subsequent thoughts that followed.

1. fsck for filesets
Say we have a huge file system that is being used in a multi-tenant cloud environment.  In some cases, we have a separate fileset for every customer, or "fileset-based multi-tenancy".  Filesets is a method of dividing the file system to have separate access and administrative operations.

Running online fsck on a huge file system would take a long time. It could take be a few days, depending on the amount of data present in the file system. And if a problem is reported by a particular customer, then we would know the fileset in question.  A fileset is a part of the file system that could be used separately.  So why not run online fsck on a single fileset or multiple filesets, rather than on the entire file system?  Of course, running online fsck on a fileset may not detect all issues. But it is a good start. If we could detect and repair all issues in the fileset, then we would avoid to run online fsck on the entire file system.  That would be a big bonus. If we could defect some of the issues in a fileset, I'd say that is still a battle half won. So running online fsck on filesets is a useful functionality we'd want to have.
To know more about filesets, please visit https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=scale-filesets

2. Self healing filesystems
If some part of a disk goes bad, the data corruption may not be noticed immediately.  Detecting such data corruption much later may lead to unwanted consequences. So what if these situations could be avoided proactively?  Is there a way?  What if we have a program that identifies that the system is now idle and runs online fsck during the idle interval?  So the data corruptions are detected and repaired before they are noticed by the users of the data.  A self healing file system is the true masterpiece we'd want to have.

3. Self healing filesets
As mentioned earlier, running online fsck on a huge file system would take a long time, depending how much is the data stored in the file system.  So if we could have the online fsck to run on a single fileset or multiple filesets, then we could use that feature during the idle intervals of the Storage Scale cluster.  So by combining the two features together, we would have self healing filesets.  Is there a cloud that has already implemented this?  If you know, you tell me  :-)

4. Performance
When we think about performance, there are two aspects to consider :
(a) For a given file system, how much time is taken for running offline fsck vs online fsck
(b) What can be done to improve the performance of running online fsck
Let's consider these one at a time.

(a) For a given file system, how much time is taken for running offline fsck vs online fsck
When we run offline fsck, the file system is unmounted and so there is no IO workload.  So load on the worker nodes is low.  While running online fsck, the file system is mounted and IO workload is in progress.  So load on the worker nodes may be high, depending on the IO workload.  Considering these situations, time required for running offline fsck is usually much less than the time required running online fsck.  In theory, we all would agree to this.  The results obtained during functional testing indicate that the time difference is not much when the amount of data in the file system is of small quantity.  The more the amount of data in the file system, the more is the time difference.  When the file system contained enormous amount of data, say in quantities of petabytes, then the time difference is huge.  In one particular case the online fsck took 10 times more time than offline fsck.

(b) What could be done to improve the performance of running online fsck
By default, online fsck uses all available nodes of the cluster to do the work.  The total work is divided in portions. Each node does some part of the work.  The file system manager node manages the distribution of work.
Not all nodes of the cluster would have the same amount of memory and processing power.  Also IO workload may not be the same on all nodes of the cluster.  So if there are say 15 worker nodes, making 15 portions of the entire work and allocating one portion to each node may not be the efficient strategy.

Depending on the available memory, available processing power, and load on each node, certain amount of work could be allocated for each node which would be appropriate to distribute the work evenly amongst all available nodes.  Moreover, if the execution is taking huge amount of time, then the amount IO workload on the nodes may vary at later times.  So the original calculation of evenly distributing the work may not remain to be the most efficient at a later time.  In such cases, a recalculation to evenly distribute the remaining work between the nodes would be beneficial.

Another simpler strategy is also possible. Not all nodes would finish their portion of the work at the same time.  Some may finish earlier than others, depending on the IO workload. The nodes which complete their portion of the work may be allocated a smaller portion of the remaining work. This redistribution of work would help to complete the entire work in lesser time.

There could be more ways to improve performance of online fsck that I could not list here.  Some food for thought for the reader.



References
1. https://en.wikipedia.org/wiki/Fsck
2. https://lwn.net/Articles/248180/
3. https://www.adminschoice.com/repairing-unix-file-system-fsck
4. https://linux.die.net/man/8/fsck
5. https://www.ibm.com/docs/en/aix/7.3?topic=f-fsck-command
6. https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=reference-mmfsckx-command