If you are sensing a pattern to some of my blogs, you are probably correct. One of the reasons I write so much about HPCC storage is that this is one of the biggest headaches for HPC right now. I hope everyone is learning something from these articles/blogs.

In this blog I want to talk about knowing your current storage and its behavior. Storage almost seems to be alive in the respect that it grows every day (“It’s alive!”). Files are created, sometimes deleted, accessed, etc. However, how well do you truly know how your storage behaves? How many files are created every day? How many are deleted? How many are accessed?

Recent Study

There was a fairly recent study from the University of California, Santa Cruz and Netapp that examined CIFS storage within Netapp. Part of the storage was deployed in the corporate data center where the hosts were used by over 1,000 marketing, sales, and finance employees. The second part of the storage was a high-end file server deployed in the engineering data center and used by over 500 engineering employees.

While people may argue, with some validity, that the study is not focused on HPCC, let me point out that part of the study was done within the engineering center. It was CIFS and that is not likely to be used within a cluster, but I think that the results give us some hints about what’s going on with the data.

Robin Harris at Storage Mojo writes a very good blog that looks at storage, many times from an HPCC perspective. Robin has written about the study and come up with some very interesting observations (from this article):

Some significant differences from prior studies:

  • Workloads are more write-oriented. Read/write byte ratios are now only 2:1 compared to the 4:1 or higher ratios reported earlier.
  • Workloads are less read-centric. Read/write workloads are now 30 times more common.
  • Most bytes transferred sequentially. These runs are 10 times the length found in the old studies.
  • Files are 10 times bigger.
  • Files live 10 times longer. Less than half are deleted within a day of creation.
Cool new findings:
  • Files are rarely reopened. Over 66 percent are reopened once and 95 percent fewer than five times.
  • Over 60 percent of file reopenings are within a minute of the first opening.
  • Less than 1 percent of clients account for 50 percent of requests.
  • Infrequent file sharing. Over 76 percent of files are opened by just one client.
  • Concurrent file sharing is very rare. As the prior point suggests, only 5 percent of files are opened by multiple clients, and 90 percent of those are read-only.
  • Most file types have no common access pattern.
And there’s this: over 90 percent of the active storage was untouched during the study. That makes it official: data is getting cooler.

While one cannot take this data and make sweeping conclusions, I do think it provides an interesting data point. In particular, there are two observations I think are pertinent to HPC:

  1. Files are rarely reopened.
  2. Over 90 percent of the active storage was untouched during the study.
These observations tell me that for this study, after a small window of time, data is rarely reopened or touched.

Robin Harris then wrote another blog that talked about some implications for this study on deduplication. In particular, he made a couple of comments that I’m paraphrasing below:

  1. Given that 90 percent of the active storage was untouched, how important is performance?
  2. Since data is not frequently accessed, then there may not be a need for a high-performance deduplication package (it can easily be done in the background).
Robin was making these points about non-engineering and likely non-HPC data, but I think the questions he raises are absolutely dead-on for all storage including HPC.

Implications for HPC Storage

In many discussions with HPCC customers, often the customer tells me that they need “massive” amounts of storage, and it has to be online all of the time and they need it now (as in yesterday). When I ask the inevitable question, “Are you sure that all of the data has to be online all of the time?” the answer is always “yes.” When I also ask, “Do you have to have all XX TB online at this moment?” the answer is invariably “yes.” But I think these are quick answers rather than true answers.

Given the observations of Robin Harris from this recent study, I think it behooves everyone to take a look at their data. In particular, we all need to understand the profile of the data, which means we need to know how old our data is, when the files were last accessed or modified, and when the files were created. From this information we can tell how quickly our data is growing and how much data and/or how many files are older than a given age.

With this information we can determine if there is data that hasn’t been accessed or touched for long periods of time, and then it can be archived to save space and perhaps also improve performance. This means you don’t need spinning media (likely to cost more) to store the data. All this means you can reduce the amount of space you need. In addition, by watching the growth of data when also archiving it, it gives you a more accurate picture of the growth of data. Consequently, you know how much storage space you need and when.

How can you accomplish this? I’m glad you asked. What you want to do is scan the file system looking at the size of the files as well as creation date, the date the file was last accessed, and the date the file was last modified (not necessarily the data but the metadata).

While this sounds fairly easy, it’s not quite as easy as you think. But to make it easier, I wrote a quick python script to do this for you (you’re welcome). The script is attached below. You just run the script from the root directory—for example, /home. The script will produce a log of all of the files below that root directory. It will also produce a CSV file (comma-delimited file) that can be read into a spreadsheet.

Here’s a quick simple example of the output:

[laytonj@home8 FS_Scan]$ ./FS_scan.py .
Starting directory (root): /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan


Current directory /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./FS_scan.py
- size: 2068 bytes
- created: Tue Dec 2 14:14:43 2008
- last accessed: Thu Dec 4 09:50:08 2008
- last modified: Sun Oct 12 14:51:41 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./FS_scan_0.01.py
- size: 2068 bytes
- created: Sun Nov 9 11:18:07 2008
- last accessed: Sat Nov 15 10:20:17 2008
- last modified: Sun Nov 9 11:18:07 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2.html
- size: 40606 bytes
- created: Tue Dec 2 14:14:32 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search.html
- size: 24849 bytes
- created: Tue Dec 2 14:14:32 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./report.csv
- size: 0 bytes
- created: Thu Dec 4 09:50:08 2008
- last accessed: Mon Dec 1 20:35:45 2008
- last modified: Thu Dec 4 09:50:08 2008
- owner: 500 500 or: laytonj laytonj
Current directory /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/smnavbar.gif
- size: 9973 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/index.gif
- size: 565 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/txtpreva.gif
- size: 588 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/txthome.gif
- size: 320 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/smbanner.gif
- size: 5503 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search2_files/txtnexta.gif
- size: 419 bytes
- created: Sat Oct 11 18:08:31 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 18:08:31 2008
- owner: 500 500 or: laytonj laytonj
Current directory /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/smnavbar.gif
- size: 9973 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/index.gif
- size: 565 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/txtpreva.gif
- size: 588 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/txthome.gif
- size: 320 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/smbanner.gif
- size: 5503 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj
File path: /home/laytonj/CLUSTER/CLUSTERBUFFER/FS_Scan/./search_files/txtnexta.gif
- size: 419 bytes
- created: Sat Oct 11 17:57:56 2008
- last accessed: Tue Nov 4 11:38:40 2008
- last modified: Sat Oct 11 17:57:56 2008
- owner: 500 500 or: laytonj laytonj


Staring Time: Thu Dec 4 09:50:08 2008
Ending Time: Thu Dec 4 09:50:09 2008


From this output you can get the dates of when the files were created, last accessed, and last modified (metadata modified). You can also load the CSV file into a spreadsheet and sort of any of the dates you want (or anything you want). For example, you could sort on the time last accessed.

The script that produced this output is attached to this Web page. But in case you want to get the latest version you can go to this site and download the latest (plus, I hope to be developing a set of tools around this basic premise so that you can track what is happening with the files).

ToDo’s

As a professor I liked to give homework only as a reinforcement of what I was lecturing on and to also point to some of the important points and nuances of the topic. My students never really complained, and I was pretty good about allowing late homework if the student came to me early and explained why. So, I’m not too concerned about giving homework now. :)

Your homework, and I won’t be collecting it, is to examine the files on your storage. In particular, look for the dates when the files were last accessed, created, and modified. You can use the script I’ve provided or use any tools you would like. Then take a look at the data output from your scan. You can put the data into a spreadsheet and order the files by date (any of the three dates). I think you’ll be surprised by what you find. :) In particular, how old some files are that have not been accessed in a long time.

If you want to have a little more fun (or frustration depending upon your vantage point), sum the sizes of the files that are older than certain cutoffs such as:

  • Older than 1 year
  • 6 months – 1 year
  • 3 months – 6 months
  • 1 month – 3 months
  • 1 week – 1 month
  • 1 day – 1 week
  • Less than 1 day
It is a lot of categories, but I think the results will be very interesting. :) When I did this on my own desktop and small cluster I was very surprised at the results.

Summary

I wanted to end with a quick summary of some things that I think people need to think about and perhaps pay more attention to (I can’t help myself—I was a teacher, so I really like to finish with a summary of key points). These points are:

  1. Look at the various ages (creation, access, modified) of the files on your system. I think you will be very surprised by what you find.
  2. In my experience very, very few people examine their storage usage except for the occasional “df” or something like that. I think people are doing themselves a disservice by not understanding what’s going on with their storage. (I call this data profiling.)
  3. Examining how your storage changes over time will allow you to better understand your true storage needs (i.e., you can save money by not buying as much as you think you need or as much as researchers are screaming for).
  4. Knowing the age of various files allows you to also begin to archive data from your high-performance storage system ($$) to a lower-cost archive system ($). This saves you money (and we all know you can get promoted by saving money rather than spending it).
I don’t want to belittle anyone because I know we’re all busy, but take a few minutes and start looking at your file system. While it sounds geeky, I think you will be surprised by what you find. Plus you will gain valuable insight into what’s going on with your storage.