Monitoring File Changes in PHP

written in php

I’ve been working on a small BDD test framework, and I found myself wanting to implement a --watch option. When the flag is set, the test runner would watch the current directory and re-run all specs when a change occurs. Though PHP offers the inotify extension, I wanted this option to be cross-platform and work without a PECL extension. So I decided to write my own implementation.

Monitoring directory changes can be done using stat() or fmtime() to get the last modification time for the directory. This includes files being renamed, added or deleted in a folder. By polling every second, for example, we can see whether or not such a change has been made to the directory tree.

However, it doesn’t cover modifications to the contents of individual files. For that purpose, I have to track every file in the directory and its sub-directories. I thought of two solutions for doing this, each with their pros and cons. The first solution would be to call stat() on the file to get both its modification time and size. I could then store these values, and simply check against them on subsequent polls. If either the modification time or size differed, I would re-run the tests.

Though efficient enough, this method would suffer from missing modifications if they occurred within a 1-2 second frame, without changing the file size. This would be the case for the ext3 and fat filesystems. Though it might be rare, it would be an annoyance when performing quick edits due to typos.

A slower, more effective alternative would be to calculate and store the sha1 digest of the contents of each file. This way we’re no longer relying on the time resolution of the file system, but rather each individual bit of the file. This would definitely be slower, but by how much?

I hoped to answer that with a benchmark. I decided to write a script that would test both methods against a directory. It measures the time taken to recurse the directory and get the modified time and size of .php files, as well as calculate their sha1 digest. The test input consisted of the contents of Symfony_Standard_Vendors_2.3.6.tgz – a copy of Symfony 2.3.6 with necessary vendors already installed. The folder is 22.8MB in size, and contains 7,169 files. The test runner would only be tracking .php files, so the number of tracked files would be smaller:

1
2
3
danielstjules:~/Desktop/Symfony
$ find . -type f -name "*.php" | wc -l
  3836

Among those php files, there’s a reported 108,488 lines coming in at over 13 MB.

1
2
3
4
5
6
7
8
danielstjules:~/Desktop/Symfony
$ find . -name '*.php' | xargs wc -l
  ...
  108448 total

danielstjules:~/Desktop/Symfony
$ find . -name '*.php' -exec ls -l {} \; | awk '{s+=$5} END {print s}'
13870472

And now for the code:

benchmark.php
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
<?php

/**
 * Benchmarks the running time of the supplied closure and outputs its average,
 * min and max over the given number of iterations.
 *
 * @param string   $title      Title of the benchmark to print
 * @param int      $iterations Number of times to run the anonymous function
 * @param callable $callable   The function to measure
 */
function benchmark($title, $iterations, $callable)
{
    $runningTimes = [];

    for ($i = 0; $i < $iterations; $i++) {
        $startTime = microtime(true);

        $callable();

        $endTime = microtime(true);
        $time = $endTime - $startTime;
        $runningTimes[] = $time;
    }

    $avg = array_sum($runningTimes) / count($runningTimes);
    $min = min($runningTimes);
    $max = max($runningTimes);

    echo "\n$title\navg: $avg\nmin: $min\nmax: $max\n";
}

/**
 * If the given path is a file, the function is called with the path as its
 * argument. Otherwise, if the path is a directory, this function recurses over
 * all sub-folders, invoking the $callable for each file.
 *
 * @param string   $path     A valid path to a file or directory
 * @param callable $callable The function to invoke for each file found, with
 *                           its path as the argument
 */
function recurseCall($path, $callable)
{
    if (is_file($path)) {
        $callable($path);
        return;
    }

    $path = realpath($path);
    $dirIterator = new RecursiveDirectoryIterator($path);
    $iterator = new RecursiveIteratorIterator($dirIterator);

    $files = new RegexIterator($iterator, '/^.+\.php$/i',
        RecursiveRegexIterator::GET_MATCH);

    foreach ($files as $file) {
        $filePath = $file[0];

        if (is_file($filePath)) {
            $callable($filePath);
        }
    }
}

$directory = dirname(__FILE__) . '/Symfony';

// Using stat size and mtime
benchmark('stat', 1000, function() use ($directory) {
    clearstatcache();
    recurseCall($directory, function($path) {
        $stat = stat($path);
        // $stat['mtime'], $stat['size'];
    });
});

// Using sha1_file
benchmark('sha1_file', 1000, function() use ($directory) {
    recurseCall($directory, function($path) {
        $digest = sha1_file($path);
    });
});

The script above calculates the average, minimum and maximum time spent iterating over all 3836 php files and applying both methods. The average is calculated over a sample of 1000 iterations. The results, on a 1.7GHz dual-core i7 MacBook Air with an SSD:

1
2
3
4
5
6
7
8
9
10
11
$ php -f benchmark.php

stat
avg: 0.094770931720734
min: 0.084547996520996
max: 0.10672402381897

sha1_file
avg: 0.19758544158936
min: 0.18978786468506
max: 0.22717094421387

Referring to the output above, using sha1_file is ~108% slower than using clearstatcache() followed by calls to stat(). For the intended use, I might be willing to sacrifice that bit of performance for greater accuracy. And though both are much less efficient than using inotify to listen for events, it’ll mean simpler installation and use.

If anyone has other ideas, I’d be grateful if you could share them!


Comments