2020. febr. 17.

Quick&dirty moving of a billion files from a single directory

I had more than a billion files in a single directory on a ZFS (on Linux) volume.

If you got multiple computers with some simple naive software writing a small statistics log file every minute to a write-only shared directory, it is not hard to get there.

Note that this is obviously not a good solution for logging, and I don't endorse this practice, but this is what we had, and simple commands like the following were no longer completing in reasonable time:
cp /mnt/stat/data/2019_06_23_* /home/kalmi/incident_06_23/

Running ls or find was taking "forever", and was affecting the performance of the rest of the system. I had never considered before that ls by default has to load the whole directory into memory (or swap 👻) for sorting the entries.

It was getting quite inconvenient. Something needed to be done, because I wanted to do some processing on some of the existing files, and for that I needed commands to complete in reasonable time.

I decided to move the existing files into the following folder structure: year/month/day
I decided I would write a Python script to do the job of moving the existing files. The original filenames contained a timestamp, and that's all that is needed to determine the destination. I couldn't find a Python library that allowed for unsorted listing of such large directories in reasonable time without buffering the whole thing. I expected to be able to just iterate with glob.glob, but it wasn't performing as expected.

I already figured out earlier that ls with right flags can do that, so I just piped the output of ls to my Python script that actually moved the files, and the following solution was born:

cd /mnt/stat/data && ls -p -1 -f | grep -v / | python3 ../iter.py3

This previous line got added to cron, so that we wouldn't get to a billion files again.

The contents of iter.py3:

import fileinput
import string
import os
import errno

for line in fileinput.input():
    print(line.strip())
    if len(line.strip().split('_')) != 4:
        print('SKIPPED')
        continue
    name, date, hour, ns = line.strip().split('_')
    if len(date.split('-')) != 3:
        print('SKIPPED')
        continue
    year, month, day = date.split('-')
    if not year.isdigit() or not month.isdigit() or not day.isdigit():
        print('SKIPPED')
        continue
    try:
        os.mkdir(year)
    except OSError as e:
        if e.errno == errno.EEXIST:
            pass
        else:
            raise
    try:
        os.mkdir(os.path.join(year,month))
    except OSError as e:
        if e.errno == errno.EEXIST:
            pass
        else:
            raise
    try:
        os.mkdir(os.path.join(year,month,day))
    except OSError as e:
        if e.errno == errno.EEXIST:
            pass
        else:
            raise
    os.rename(line.strip(), os.path.join(year,month,day,line.strip()))

After running the script, the directory only contained a few folders, but running ls in that directory would still take a few seconds. It's interesting to me that the directory remained affected in this way. I guess ZFS didn't clean-up its now empty structures.