| Path: | fsdb.txt |
| Last Update: | Sun Oct 22 12:47:21 PDT 2006 |
FSDB is a file system data base. FSDB provides a thread-safe, process-safe Database class which uses the native file system as its back end and allows multiple file formats and serialization methods. Users access objects in terms of their paths relative to the base directory of the database. It’s very light weight (the state of a Database is essentially just a path string, and code size is very small, under 1K lines, all ruby).
FSDB stores bundles of ruby objects at nodes in the file system. Each bundle is saved and restored as a whole, so internal references persist as usual. These bundles are the atoms of transactions. References between bundles are handled through path strings. The format of each bundle on disk can vary; format classes for plain text strings, marshalled data, and yaml data are included, but FSDB can easily be extended to recognize other formats, both binary and text. FSDB treats directories as collections and provides directory iterator methods.
FSDB has been tested on a variety of platforms and ruby versions, and is not known to have any problems. (On WindowsME/98/95, multiple processes can access a database unsafely, because flock() is not available on the platform.) See the Testing section for details.
FSDB does not yet have any indexing or querying mechanisms, and is probably missing many other useful database features, so it is not a general replacement for RDBs or OODBs. However, if you are looking for a lightweight, concurrent object store with reasonable performance and better granularity than PStore, in pure Ruby, with a Ruby license, take a look at FSDB. Also, if you are looking for an easy way of making an existing file tree look like a database, especially if it has heterogeneous file formats, FSDB might be useful.
ruby install.rb config ruby install.rb setup ruby install.rb install
require 'fsdb'
db = FSDB::Database.new('/tmp/my-data')
db['recent-movies/myself'] = ["LOTR II", "Austin Powers"]
puts db['recent-movies/myself'][0] # ==> "LOTR II"
db.edit 'recent-movies/myself' do |list|
list << "A la recherche du temps perdu"
end
Keys in the database are path strings, which are simply strings in the usual forward-slash delimited format, relative to the database’s directory. There are some points to be aware of when using them to refer to database objects.
The root dir of the database is simply /, its child directories are of the form foo/ and so on. The leading and trailing slashes are both optional.
| foo.obj: | Marshalled data (the default for unrecognized extension) |
| foo.txt: | String |
| foo/: | Directory (the contents is presented to the caller as a list of file and subdirectory paths that can be used in browse, edit, etc.) |
| foo.yml: | YAML data—see examples/yaml.rb |
New formats, which correlate filename pattern with serialization behavior, can be defined and plugged in to databases. Each format has its own rules for matching patterns in the file name and recognizing the file. Patterns can be anything with a #=== method (such as a regex). See lib/fsdb/formats.rb examples of defining formats. For examples of associating formats with patterns, see examples/formats.rb.
/foo/bar foo/bar foo//bar foo/../foo/bar
work correctly, as do paths that denote hard or soft links, if supported on the platform.
Links are subject to the same naming convention as normal files with regard to format identification: format is determined by the path within the database used to access the object. Using a different name for a link can be useful if you need to access the file using two different formats (e.g., plain text via ‘foo.txt’ and tabular data via ‘foo.table’ or whatever).
db = Database.new['/tmp']
db['foo/bar'] = 1
foo = db.subdb('foo')
foo['bar'] # ==> 1
The ..fsdb.meta.<filename> file holds a version number for <filename>, which is used along with mtime to check for changes (mtime usually has a precision of only 1 second). In the future, the file may also be used to hold other metadata. (The meta file is only created when a file is written to and does not need to be created in advance when using existing files as a FSDB.)
FSDB transactions are thread-safe and process-safe. They can be nested for larger-grained transactions; it is the user’s responsibility to avoid deadlock.
FSDB is ACID (atomic/consistent/isolated/durable) to the extent that the underlying file system is. For instance, when an object that has been modified in a transaction is written to the file system, nothing persistent is changed until the final system call to write the data to the OS’s buffers. If there is an interruption (e.g., a power failure) while the OS flushes those buffers to disk, data will not be consistent. If this bothers you, you may want to use a journaling file system. FSDB does not need to do its own journaling because of the availability of good journaling file systems.
There are two kinds of transactions:
Note that a sequence of such transactions is not itself a transaction, and can be affected by other processes and threads.
db['foo/bar'] = [1,2,3] db['foo/bar'] += [4] # This is actually 2 transactions db['foo/bar'][-1]
It is possible for the result of these transactions to be 4. But, if other threads or processes are scheduled during this code fragment, the result could be a completely different value, or the code could raise an method-missing exception because the object at the path has been replaced with one that does not have the + method or the [ ] method. The four operations are atomic by themselves, but the sequence is not.
Note that changes to a database object using this kind of transaction cannot be made using destructive methods (such as <<) but only by assignments of the form db[<path>] = <data>. Note that += and similar "assignment operators" can be used but are not atomic, because
db[<path>] += 1
is really
db[<path>] = db[<path>] + 1
So another thread or process could change the value stored at path while the addition is happening.
path = 'foo/bar'
db[path] = [1,2,3]
db.edit path do |bar|
bar += [4]
bar[-1]
end
This guarantees that, if the object at the path is still [1, 2, 3] at the time of the edit call, the value returned by the transaction will be +4+.
Simply put, edit allows exclusive write access to the object at the path for the duration of the block. Other threads or processes that use FSDB methods to read or write the object will be blocked for the duration of the transaction. There is also browse, which allows read access shared by any number of threads and processes, and replace, which also allows exclusive write access like edit. The differences between replace and edit are:
You can delete an object from the database (and the file system) with delete, which returns the object. Also, delete can take a block, which can examine the object and abort the transaction to prevent deletion. (The delete transaction has the same exclusion semantics as edit and replace.)
The fetch and insert methods are aliased with [ ] and [ ]=.
When the object at the path specified in a transaction does not exist in the file system, the different transaction methods behave differently:
Transactions can be nested. However, the order in which objects are locked can lead to deadlock if, for example, the nesting is cyclic, or two threads or processes request the same set of locks in a different order. One approach is to only request nested locks on paths in the lexicographic order of the path strings: "foo/bar", "foo/baz", …
A transaction can be aborted with Database#abort and Database.abort, after which the state of the object in the database remains as before the transaction. An exception that is raised but not handled within a transaction also aborts the transaction.
Note that there is no locking on directories, but you can designate a lock file for each dir and effectively have multiple-reader, single writer (advisory) locking on dirs. Just make sure you enclose your dir operation in a transaction on the lock object, and always access these objects using this technique.
db.browse('lock for dir') do
db['dir/x'] = 1
end
FSDB has been tested on the following platforms and file systems:
- Linux/x86 (single and dual cpu, ext3fs and reiserfs) - Solaris/sparc (dual and quad cpu, nfs and ufs) - QNX 6.2.1 (dual PIII) - Windows 2000 (dual cpu, NTFS) - Windows ME (single cpu, FAT32)
FSDB is currently tested with ruby-1.9.0 and ruby-1.8.4.
On windows, both the mswin32 and mingw32 builds of ruby have been used with FSDB. It has never been tested with cygwin or bccwin.
The tests include unit and stress tests. Unit tests isolate individual features of the library. The stress test (called test/test-concurrency.rb) has many parameters, but typically involves several processes, each with several threads, doing millions of transactions on a small set of objects.
The only known testing failure is on Windows ME (and presumably 95 and 98). The stress test succeeds with one process and multiple threads. It succeeds with multiple processes each with one thread. However, with two processes each with two threads, the test usually deadlocks very quickly.
FSDB is not very fast. It’s useful more for its safety, flexibility, and ease of use.
processes threads objects transactions per cpu second --------------------------------------------------------------- 1 1 10 965 1 10 10 165 10 1 10 684 10 10 10 122 10 10 100 100 10 10 10000 92
These results are not representative of typical applications, because the test was designed to stress the database and expose stability problems, not to immitate typical use of database-stored objects. See bench/bench.rb for for bechmarks.
The performance hit of fetch is of course greater with larger objects, and with objects that are loaded by a more complex procedure, such as Masrshal.load.
You can think of fetch as a "deep copy" of the object. If you call it twice, you get different copies that do not share any parts. Or you can think of it as File.read—it gives you an instantaneous snapshot of the file, but does not give you a transaction "window" in which no other thread or process can modify the object.
There is no analogous concern with insert and its alias #[]=. These methods always write to the file system, but they also leave the object in the cache.
home['.forward'] += ["nobody@nowhere.net"]
etc.edit('passwd') { |passwd| passwd['fred'].shell = '/bin/zsh' }
window.setIcon(icons['apps/editor.png'])
I’ve heard from a couple of people writing applications that use FSDB. One app is:
db['foo/bar.obj'] = "some string"
referrer = { :my_bar => FSDB::Reference.new('../foo/bar.obj') }
db['x/y.yml'] = referrer
p db['x/y.yml'][:my_bar] # ==> "some string"
Or, more like DRbUndumped:
str = "some string"
str.extend FSDB::Undumped
db['foo/bar.obj'] = str
referrer = { :my_bar => str }
db['x/y.yml'] = referrer
p db['x/y.yml'][:my_bar] # ==> "some string"
Extending with FSDB::Undumped will have to insert state in the object that remembers the db path at which it is stored (‘foo/bar.obj’ in this case).
for path, object in hash ... end
end
then:
irb> irb db irb#1> irb_browse path ... ... # got a read lock for this session ... irb#1> ^D irb>
one problem: irb defines singleton methods, so can’t dump (in edit)
maybe we can extend the class of the object by some module instead…
.que : use IO#read_object, IO#write_object (at end of file)
to implement a persistent queue
fifo, named socket, device, ...
fsdb 0.5
The current version of this software can be found at redshift.sourceforge.net/fsdb.
This software is distributed under the Ruby license. See www.ruby-lang.org.
Joel VanderWerf, vjoel@users.sourceforge.net Copyright © 2003-2006, Joel VanderWerf.