XtreemFS is available from the http://www.XtreemFS.orgXtreemFS website (www.XtreemFS.org).
This document is © 2009-2011 by Björn Kolbeck, Jan Stender, Michael Berlin, Paul Seiferth, Felix Langner, NEC HPC Europe, Felix Hupfeld, Juan Gonzales. All rights reserved.
Summary of important changes in release 1.2.1:
This is a summary of the most important changes in release 1.2:
This is the very short version to help you set up a local installation of XtreemFS.
mount.xtreemfs localhost/myVolume ~/xtreemfs
You can also mount this volume on remote computers. First make sure that the ports 32636, 32638 and 32640 are open for incoming TCP connections. You must also specify a hostname that can be resolved by the remote machine! This hostname has to be used instead of localhost when mounting.
Since you decided to take a look at this user guide, you probably read or heard about XtreemFS and want to find out more. This chapter contains basic information about the characteristics and the architecture of XtreemFS.
Since version 1.0, XtreemFS supports read-only replication. A file may have multiple replicas, provided that the it was explicitly made read-only before, which means that its content cannot be changed anymore. This kind of replication can be used to make write-once files available to many consumers, or to protect them from losses due to hardware failures. Besides complete replicas that are immediately synchronized after having been created, XtreemFS also supports partial replicas that are only filled with content on demand. They can e.g. be used to make large files accessible to many clients, of which only parts need to be accessed.
Authentication describes the process of verifying a user's or client's identity. By default, authentication in XtreemFS is based on local user names and depends on the trustworthiness of clients and networks. In case a more secure solution is needed, X.509 certificates can be used.
Authorization describes the process of checking user permissions to execute an operation. XtreemFS supports the standard UNIX permission model, which allows for assigning individual access rights to file owners, owning groups and other users.
Authentication and authorization are policy-based, which means that different models and mechanisms can be used to authenticate and authorize users. Besides, the policies are pluggable, i.e. they can be freely defined and easily extended.
XtreemFS uses unauthenticated and unencrypted TCP connections by default. To encrypt all network traffic, services and clients can establish SSL connections. However, using SSL requires that all users and services have valid X.509 certificates.
In contrast to block-based file systems, the management of available and used storage space is offloaded from the metadata server to the storage servers. Rather than inode lists with block addresses, file metadata contains lists of storage servers responsible for the objects, together with striping policies that define how to translate between byte offsets and object IDs. This implies that object sizes may vary from file to file.
These servers are connected by the client to a file system. A client mounts one of the volumes of the MRC in a local directory. It translates file system calls into RPCs sent to the respective servers.
The client is implemented as a FUSE user-level driver that runs as a normal process. FUSE itself is a kernel-userland hybrid that connects the user-land driver to Linux' Virtual File System (VFS) layer where file system drivers usually live.
This chapter describes how to install and set up the server side of an XtreemFS installation.
When installing XtreemFS server components, you can choose from two different installation sources: you can download one of the pre-packaged releases that we create for most Linux distributions or you can install directly from the source tarball.
Note that the source tarball contains the complete distribution of XtreemFS, which also includes client and tools. Currently, binary distributions of the server are only available for Linux.
For the pre-packaged release, you will need Sun Java JRE 1.6.0 or newer to be installed on the system.
When building XtreemFS directly from the source, you need a Sun Java JDK 1.6.0 or newer, Ant 1.6.5 or newer and gmake.
On RPM-based distributions (RedHat, Fedora, SuSE, Mandriva) you can install the package with
$> rpm -i xtreemfs-server-1.3.x.rpm xtreemfs-backend-1.3.x.rpm
For Debian-based distributions, please use the .deb package provided and install it with
$> dpkg -i xtreemfs-server-1.3.x.deb xtreemfs-backend-1.3.x.deb
To install the server components, the following package is required: jre 3#3 1.6.0 for RPM-based releases, java6-runtime for Debian-based releases. If you already have a different distribution of Java6 on your system, you can alternatively install the XtreemFS server packages as follows:
$> rpm -i --nodeps xtreemfs-server-1.3.x.rpm \ xtreemfs-backend-1.3.x.rpm
on RPM-based distributions,
$> dpkg -i --ignore-depends java6-runtime \ xtreemfs-server-1.3.x.deb xtreemfs-backend-1.3.x.deb
on Debian-based distributions.
To ensure that your local Java6 installation is used, is necessary to set the JAVA_HOME environment variable to your Java6 installation directory, e.g.
$> export JAVA_HOME=/usr/java6
Both RPM and Debian-based packages will install three init.d scripts to start up the services (xtreemfs-dir, xtreemfs-mrc, xtreemfs-osd). If you want the services to be started automatically when booting up the system, you can execute insserv <init.d script> (SuSE), chkconfig -add <init.d script> (Mandriva, RedHat) or update-rc.d <init.d script> defaults (Ubuntu, Debian).
Extract the tarball with the sources. Change to the top level directory and execute
$> make server
This will build the XtreemFS server and Java-based tools. When done, execute
$> sudo make install-server
to install the server components. Finally, you will be asked to execute a post-installation script
$> sudo /etc/xos/xtreemfs/postinstall_setup.sh
to complete the installation.
After having installed the XtreemFS server components, it is recommendable to configure the different services. This section describes the different configuration options.
XtreemFS services are configured via Java properties files that can be modified with a normal text editor. Default configuration files for a Directory Service, MRC and OSD are located in /etc/xos/xtreemfs/.
XtreemFS uses UUIDs (Universally Unique Identifiers) to be able to identify services and their associated state independently from the machine they are installed on. This implies that you cannot change the UUID of an MRC or OSD after it has been used for the first time!
The Directory Service resolves UUIDs to service endpoints, where each service endpoint consists of an IP address or hostname and port number. Each endpoint is associated with a netmask that indicates the subnet in which the mapping is valid. In theory, multiple endpoints can be assigned to a single UUID if endpoints are associated with different netmasks. However, it is currently only possible to assign a single endpoint to each UUID; the netmask must be ``*'', which means that the mapping is valid in all networks. Upon first start-up, OSDs and MRCs will auto-generate the mapping if it does not exist, by using the first available network device with a public address.
Changing the IP address, hostname or port is possible at any time. Due to the caching of UUIDs in all components, it can take some time until the new UUID mapping is used by all OSDs, MRCs and clients. The TTL (time-to-live) of a mapping defines how long an XtreemFS component is allowed to keep entries cached. The default value is 3600 seconds (1 hour). It should be set to shorter durations if services change their IP address frequently.
To create a globally unique UUID you can use tools like uuidgen. During installation, the post-install script will automatically create a UUID for each OSD and MRC if it does not have a UUID assigned.
Security: The automatic discovery is a potential security risk when used in untrusted environments as any user can start-up DIR services.
A statically configured DIR address and port can be used to disable DIR discovery in the OSD and MRC (see Sec. 3.2.5, dir_service). By default. the DIR responds to UDP broadcasts. To disable this feature, set discover = false in the DIR service config file.
To set the authentication provider, it is necessary to set the following property in the MRC configuration file:
authentication_provider = <classname>
By default, the following class names can be used:
In order to enable certificate-based authentication in an XtreemFS installation, services need to be equipped with X.509 certificates. Certificates are used to establish a mutual trust relationship among XtreemFS services and between the XtreemFS client and XtreemFS services.
Note that it is not possible to mix SSL-enabled and non-SSL services in an XtreemFS installation! If you only need authentication based on certificates without SSL, you can use the ``grid SSL'' mode. In this mode XtreemFS will only do an SSL handshake and fall back to plain TCP for communication. This mode is insecure (not encrypted and records are not signed) but just as fast as the non-SSL mode. If this mode is enabled, all client tools must be used with the pbrpcg:// scheme prefix.
Each XtreemFS service needs a certificate and a private key in order to be run. Once they have been created and signed, the credentials may need to be converted into the correct file format. XtreemFS services also need a trust store that contains all trusted Certification Authority certificates.
By default, certificates and credentials for XtreemFS services are stored in
/etc/xos/xtreemfs/truststore/certs
$> openssl pkcs12 -export -in ds.pem -inkey ds.key \ -out ds.p12 -name "DS" $> openssl pkcs12 -export -in mrc.pem -inkey mrc.key \ -out mrc.p12 -name "MRC" $> openssl pkcs12 -export -in osd.pem -inkey osd.key \ -out osd.p12 -name "OSD"
This will create three PKCS12 files (ds.p12, mrc.p12 and osd.p12), each containing the private key and certificate for the respective service. The passwords chosen when asked must be set as a property in the corresponding service configuration file.
The certificate (or multiple certificates) from your CA (or CAs) can be imported into a Java Keystore (JKS) using the Java keytool which comes with the Java JDK or JRE.
Execute the following steps for each CA certificate using the same keystore file.
$> keytool -import -alias rootca -keystore trusted.jks \ -trustcacerts -file ca-cert.pem
This will create a new Java Keystore trusted.jks with the CA certificate in the current working directory. The password chosen when asked must be set as a property in the service configuration files.
Note: If you get the following error
keytool error: java.lang.Exception: Input not an X.509 certificateyou should remove any text from the beginning of the certificate (until the ---BEGIN CERTIFICATE--- line).
Users can easily set up their own CA (certificate authority) and create and sign certificates using openssl for a test setup.
$> mkdir ca
$> openssl req -new -newkey rsa:1024 -nodes -out ca/ca.csr \ -keyout ca/ca.key
Enter something like XtreemFS-DEMO-CA as the common name (or something else, but make sure the name is different from the server and client name!).
$> openssl x509 -trustout -signkey ca/ca.key -days 365 -req \ -in ca/ca.csr -out ca/ca.pem
$> echo "02" > ca/ca.srl
[commandchars=\\\{\}] $> openssl req -new -newkey rsa:1024 -nodes \ \\ -out \textit{service}.req \ \\ -keyout \textit{service}.key
[commandchars=\\\{\}] $> openssl x509 -CA ca/ca.pem -CAkey ca/ca.key \ \\ -CAserial ca/ca.srl -req \ \\ -in \textit{service}.req \ \\ -out \textit{service}.pem -days 365
[commandchars=\\\{\}] $> openssl pkcs12 -export -in \textit{service}.pem -inkey \textit{service}.key \ \\ -out \textit{service}.p12 -name "\textit{service}"
[commandchars=\\\{\}] $> mkdir -p /etc/xos/xtreemfs/truststore/certs \\ $> cp \textit{service}.p12 /etc/xos/xtreemfs/truststore/certs
$> keytool -import -alias ca -keystore trusted.jks \ -trustcacerts -file ca/ca.pem $> cp trusted.jks /etc/xos/xtreemfs/truststore/certs
Use
$> mkfs.xtreemfs --pkcs12-file-path=\ /etc/xos/xtreemfs/truststore/certs/client.p12 pbrpcs://localhost/testfor SSL-enabled servers, or
$> mkfs.xtreemfs --pkcs12-file-path=\ /etc/xos/xtreemfs/truststore/certs/client.p12 pbrpcg://localhost/testfor Grid-SSL-enabled servers.
Use
$> mount.xtreemfs --pkcs12-file-path=\ /etc/xos/xtreemfs/truststore/certs/client.p12 pbrpcs://localhost/test /mntfor SSL-enabled servers, or
$> mount.xtreemfs --pkcs12-file-path=\ /etc/xos/xtreemfs/truststore/certs/client.p12 pbrpcg://localhost/test /mntfor Grid-SSL-enabled servers.
All configuration parameters that may be used to define the behavior of the different services are listed in this section. Unless marked as optional, a parameter has to occur (exactly once) in a configuration file. Parameters marked as experimental belong to the DIR and MRC replication feature, which is currently under development. It is not recommended to mess about with these options if you want to use XtreemFS in production.
Services | DIR, MRC, OSD |
Values | String |
Default | |
Description | Defines the admin password that must be sent to authorize requests like volume creation, deletion or shutdown. The same password is also used to access the HTTP status page of the service (user name is admin). |
Services | MRC |
Values | Java class name |
Default | org.xtreemfs.common.auth.NullAuthProvider |
Description | Defines the Authentication Provider to use to retrieve the user identity (user ID and group IDs). See Sec. 3.2.3 for details. |
Services | DIR, MRC |
Values | absolute file system path to a directory |
Default | DIR: /var/lib/xtreemfs/dir/database |
MRC: /var/lib/xtreemfs/mrc/database | |
Description | The directory in which the Directory Service or MRC will store their databases. This directory should never be on the same partition as any OSD data, if both services reside on the same machine. Otherwise, deadlocks may occur if the partition runs out of free disk space. |
Services | DIR, MRC |
Values | a file name |
Default | DIR: config.db |
MRC: config.db | |
Description | Name for the database configuration file. |
Services | DIR, MRC |
Values | a positive integer value |
Default | DIR: 300 |
MRC: 300 | |
Description | The number of seconds between two checks of the disk log size for automatic checkpointing. Set this value to 0 to disable automatic checkpointing. |
Services | DIR, MRC |
Values | true or false |
Default | DIR: false |
MRC: false | |
Description | Flag that determines whether database content shall be compressed or not. |
Services | DIR, MRC |
Values | 0, 1, 2, 3, 4, 5, 6, 7 |
Default | DIR: 4 |
MRC: 4 | |
Description | This is the debug level for BabuDB only. The debug level determines the amount and detail of information written to logfiles. Any debug level includes log messages from lower debug levels. The following log levels exist:
|
Services | DIR, MRC |
Values | absolute file system path |
Default | DIR: /var/lib/xtreemfs/dir/db-log |
MRC: /var/lib/xtreemfs/mrc/db-log | |
Description | The directory the MRC uses to store database logs. This directory should never be on the same partition as any OSD data, if both services reside on the same machine. Otherwise, deadlocks may occur if the partition runs out of free disk space. |
Services | DIR, MRC |
Values | a positive integer value |
Default | DIR: 16777216 |
MRC: 16777216 | |
Description | If automatic checkpointing is enabled, a checkpoint is created when the disk logfile exceeds maxLogfileSize bytes. The value should be reasonable large to keep the checkpointing-rate low. However, it should not be too large as a large disk log increases the recovery time after a crash. |
Services | DIR, MRC |
Values | a positive integer value |
Default | DIR: 200 |
MRC: 0 | |
Description | The BabuDB disk logger can batch multiple operations into a single write+fsync to increase the throughput. This does only work if there are operations executed in parallel by the worker threads. In turn, if you work on a single database it becomes less efficient. To circumvent this problem, BabuDB offers a pseudo-sync mode which is similar to the PostgreSQL write-ahead log (WAL). If pseduoSyncWait is set to a value larger then 0, this pseudo-sync mode is enabled. In this mode, insert operations are acknowledged as soon as they have been executed on the in-memory database index. The disk logger will execute a batch write of up to 500 operations followed by a single sync (see syncMode) every pseudoSyncWait ms. This mode is considerably faster than synchronous writes but you can lose data in case of a crash. In contrast to ASYNC mode the data loss is limited to the operations executed in the last pseudoSyncWait ms. |
Services | DIR, MRC |
Values | ASYNC, SYNC_WRITE_METADATA, SYNC_WRITE, |
FDATASYNC or FSYNC | |
Default | DIR: FSYNC |
MRC: ASYNC | |
Description | The sync mode influences how operations are committed to the disk log before the operation is acknowledged to the caller.
|
Services | DIR, MRC |
Values | a positive integer value |
Default | DIR: 250 |
MRC: 250 | |
Description | If set to a value larger than 0, this is the maximum number of requests which can be in a worker's queue. This value should be used if you have pseudo-synchronous mode enabled to ensure that your queues don't grow until you get an out of memory exception. Can be set to 0 if pseudo-sync mode is disabled. |
Services | DIR, MRC |
Values | a positiv integer value |
Default | DIR: 0 |
MRC: 0 | |
Description | The number of worker threads to be used for database operations. As BabuDB does not use locking, each database is handled by only one worker thread. If there are more databases than worker threads, the databases are distributed onto the available threads. The number of threads should be set to a value smaller than the number of available cores to reduce overhead through context switches. You can also set the number of worker threads to 0. This will considerably reduce latency, but may also decrease throughput on a multi-core system with more than one database. |
Services | MRC, OSD |
Values | String |
Default | |
Description | Defines a shared secret between the MRC and all OSDs. The secret is used by the MRC to sign capabilities, i.e. security tokens for data access at OSDs. In turn, an OSD uses the secret to verify that the capability has been issued by the MRC. |
Services | MRC |
Values | seconds |
Default | 600 |
Description | Defines the relative time span for which a capability is valid after having been issued. |
Services | OSD |
Values | true, false |
Default | false |
Description | If set to true, the OSD will calculate and store checksums for newly created objects. Each time a checksummed object is read, the checksum will be verified. |
Services | OSD |
Values | Adler32, CRC32 |
Default | Adler32 |
Description | Must be specified if checksums.enabled is enabled. This property defines the algorithm used to create OSD checksums. |
Services | DIR, MRC, OSD |
Values | 0, 1, 2, 3, 4, 5, 6, 7 |
Default | 6 |
Description | The debug level determines the amount and detail of information written to logfiles. Any debug level includes log messages from lower debug levels. The following log levels exist:
|
Services | DIR, MRC, OSD |
Values | all, lifecycle, net, auth, stage, proc, db, misc |
Default | all |
Description | Debug categories determine the domains for which log messages will be printed. By default, there are no domain restrictions, i.e. log messages form all domains will be included in the log. The following categories can be selected:
|
Services | MRC, OSD |
Values | hostname or IP address |
Default | localhost |
Description | Specifies the hostname or IP address of the directory service (DIR) at which the MRC or OSD should register. The MRC also uses this Directory Service to find OSDs. If set to .autodiscover the service will use the automatic DIR discovery mechanism (see Sec. 3.2.2). (Note that the initial `.' is used to avoid ambiguities with hosts called ``autodiscover''.) |
Services | MRC, OSD |
Values | 1 .. 65535 |
Default | 32638 |
Description | Specifies the port on which the remote directory service is listening. Must be identical to the listen_port in your directory service configuration. |
Services | DIR |
Values | true, false |
Default | true |
Description | If set to true the DIR will received UDP broadcasts and advertise itself in response to XtreemFS components using the DIR automatic discovery mechanism. If set to false, the DIR will ignore all UDP traffic. For details see Sec. 3.2.2. |
Services | MRC |
Values | true, false |
Default | false |
Description | Enables support for FIFOs (names pipes) on the local machine for compatibility reasons. If set to false, any attempt to open a FIFO will be rejected with EIO. Even if set to true, FIFOs will not work across multiple mounts. |
Services | OSD |
Values | milliseconds |
Default | 1000 |
Description | Maximum clock drift between any two clocks in the system. If the actual drift between two server clocks exceeds this value, read-write replication may lead to inconsistent replicas. Since servers automatically synchronize their clocks with the clock on the DIR, however, the default 1000ms should be enough in most cases. |
Services | OSD |
Values | milliseconds |
Default | 15000 |
Description | Duration of a lease in milliseconds. For read-write-replicated files, the lease timeout specifies the validity time span of a master lease. Shorter lease timeouts guarantee a shorter fail-over period in the event of a server crash, which however comes at the cost of an increased rate of lease negotiations for each open file. The lease timeout should be set to a value at least three times flease.message_to_ms. |
Services | OSD |
Values | milliseconds |
Default | 500 |
Description | Time to wait for responses from other OSDs when negotiating leases for replicated files. This value should be larger than the maximum message round-trip time via TCP between any pair of OSDs. |
Services | OSD |
Values | 1..1000 |
Default | 3 |
Description | Number of times to retry acquiring a lease for a replicated file before an IO error is sent to the client. |
Services | DIR, MRC, OSD |
Values | String |
Default | |
Description | Specifies the geographic coordinates which are registered with the directory service. Used e.g. by the web console. |
Services | MRC, OSD |
Values | String |
Default | |
Description | If specified, it defines the host name that is used to register the service at the directory service. If not specified, the host address defined in listen.address will be used if specified. If neither hostname nor listen.address are specified, the service itself will search for externally reachable network interfaces and advertise their addresses. |
Services | DIR, MRC, OSD |
Values | 1 .. 65535 |
Default | 30636 (MRC), 30638 (DIR), 30640 (OSD) |
Description | Specifies the listen port for the HTTP service that returns the status page. |
Services | OSD |
Values | true, false |
Default | false |
Description | When set to true, capability checks on the OSD are disabled. This property should only be set to true for debugging purposes, as it effectively overrides any security mechanisms on the system. |
Services | OSD |
Values | IP address |
Default | |
Description | If specified, it defines the interface to listen on. If not specified, the service will listen on all interfaces (any). |
Services | DIR, MRC, OSD |
Values | 1 .. 65535 |
Default | DIR: 32638, |
MRC: 32636, | |
OSD: 32640 | |
Description | The port to listen on for incoming connections (TCP). The OSD uses the specified port for both TCP and UDP. Please make sure to configure your firewall to allow incoming TCP traffic (plus UDP traffic, in case of an OSD) on the specified port. |
Services | MRC, OSD |
Values | milliseconds |
Default | 50 |
Description | Reading the system clock is a slow operation on some systems (e.g. Linux) as it is a system call. To increase performance, XtreemFS services use a local variable which is only updated every local_clock_renewal milliseconds. |
Services | DIR |
Values | true, false |
Default | false |
Description | Enables the built-in monitoring tool in the directory service. If enabled, the DIR will send alerts via emails if services are crashed (i.e. do not send heartbeat messages). No alerts will be sent for services which signed-off at the DIR. To enable monitoring you also need to configure monitoring.email.receiver, monitoring.email.program. In addition, you may want to change the values for monitoring.email.sender, monitoring.max_warnings, monitoring.service_timeout_s. |
Services | DIR |
Values | path |
Default | /usr/sbin/sendmail |
Description | Location of the sendmail binary to be used for sending alert mails. See monitoring parameters. |
Services | DIR |
Values | email address |
Default | - |
Description | Email address of recipient of alert emails. See monitoring parameters. |
Services | DIR |
Values | email address |
Default | ``XtreemFS DIR service <dir@localhost>'' |
Description | Email address and sender name to use for sending alert mails. See monitoring parameters. |
Services | DIR |
Values | 0..N |
Default | 1 |
Description | Number of alert mails to send for a single service which has crashed/disconnected. Each alert mail contains a summary of all crashed/disconnected services. See monitoring parameters. |
Services | DIR |
Values | 0..N seconds |
Default | 300 |
Description | Time to wait for a heartbeat message before sending an alert email. See monitoring parameters. |
Services | MRC |
Values | true, false |
Default | true |
Description | The POSIX standard defines that the atime (timestamp of last file access) is updated each time a file is opened, even for read. This means that there is a write to the database and hard disk on the MRC each time a file is read. To reduce the load, many file systems (e.g. ext3) including XtreemFS can be configured to skip those updates for performance. It is strongly suggested to disable atime updates by setting this parameter to true. |
Services | OSD |
Values | absolute file system path to a directory |
Default | /var/lib/xtreemfs/osd/ |
Description | The directory in which the OSD stores the objects. This directory should never be on the same partition as any DIR or MRC database, if both services reside on the same machine. Otherwise, deadlocks may occur if the partition runs out of free disk space! |
Services | MRC |
Values | seconds |
Default | 300 |
Description | The MRC regularly asks the directory service for suitable OSDs to store files on (see OSD Selection Policy, Sec. 7.3). This parameter defines the interval between two updates of the list of suitable OSDs. |
Services | MRC, OSD, DIR |
Values | absolute file system path to a directory |
Default | |
Description | Directory containing user-defined policies and modules. When starting a service, the policy directory will be searched for custom policies. For further details on pluggable policies, see chapter 7. |
Services | MRC, OSD |
Values | milliseconds |
Default | 30,000 |
Description | MRCs and OSDs all synchronize their clocks with the directory service to ensure a loose clock synchronization of all services. This is required for leases to work correctly. This parameter defines the interval in milliseconds between time updates from the directory service. |
Services | MRC |
Values | true, false |
Default | false |
Description | If set to true, the MRC allows capabilities to be renewed after they timed out. This parameter should only be used for debugging purposes, as it effectively overrides the revocation of access rights on a file. |
Services | OSD |
Values | true, false |
Default | true |
Description | If set to true, the OSD will report its free space to the directory service. Otherwise, it will report zero, which will cause the OSD not to be used by the OSD Selection Policies (see Sec. 7.3). |
Services | OSD |
Values | size in bytes |
Default | -1 |
Description | The send buffer size in bytes for sockets. -1 indicates that the default value (typically 128k) is used. |
Services | OSD |
Values | size in bytes |
Default | -1 |
Description | The receive buffer size in bytes for sockets. -1 indicates that the default value (typically 128k) is used. |
Services | DIR, MRC, OSD |
Values | true, false |
Default | false |
Description | If set to true, the service will use SSL to authenticate and encrypt connections. The service will not accept non-SSL connections if ssl.enabled is set to true. |
Services | DIR, MRC, OSD |
Values | true, false |
Default | false |
Description | In this mode the services and client will only use SSL for mutual authentication with X.509 certificates (SSL handshake). After successful authentication the communication is via plain TCP. This means that there is no encryption and signing of records! This mode is comparable to HTTP connections with Digest authentication. It should be used when certificate based authentication is required but performance is more important than security, which is usually true in GRID installations. If this mode is enabled, all client tools must be used with the pbrpcg:// scheme prefix. |
Services | DIR, MRC, OSD |
Values | path to file |
Default | DIR: /etc/xos/xtreemfs/truststore/certs/ds.p12, |
MRC: /etc/xos/xtreemfs/truststore/certs/mrc.p12, | |
OSD: /etc/xos/xtreemfs/truststore/certs/osd.p12 | |
Description | Must be specified if ssl.enabled is enabled. Specifies the file containing the service credentials (X.509 certificate and private key). PKCS#12 and JKS format can be used, set ssl.service_creds.container accordingly. This file is used during the SSL handshake to authenticate the service. |
Services | DIR, MRC, OSD |
Values | pkcs12 or JKS |
Default | pkcs12 |
Description | Must be specified if ssl.enabled is enabled. Specifies the file format of the ssl.service_creds file. |
Services | DIR, MRC, OSD |
Values | String |
Default | |
Description | Must be specified if ssl.enabled is enabled. Specifies the password which protects the credentials file ssl.service_creds. |
Services | DIR, MRC, OSD |
Values | path to file |
Default | /etc/xos/xtreemfs/truststore/certs/xosrootca.jks |
Description | Must be specified if ssl.enabled is enabled. Specifies the file containing the trusted root certificates (e.g. CA certificates) used to authenticate clients. |
Services | DIR, MRC, OSD |
Values | pkcs12 or JKS |
Default | JKS |
Description | Must be specified if ssl.enabled is enabled. Specifies the file format of the ssl.trusted_certs file. |
Services | DIR, MRC, OSD |
Values | Java class name |
Default | |
Description | Sets a custom trust manager class for SSL connections. The trust manager is responsible for checking certificates when SSL connections are established. |
Services | DIR, MRC, OSD |
Values | String |
Default | |
Description | Must be specified if ssl.enabled is enabled. Specifies the password which protects the trusted certificates file ssl.trusted_certs. |
Services | MRC, OSD |
Values | 0..N seconds |
Default | 30 |
Description | Time to wait for the DIR to become available during start up of the MRC and OSD. If the DIR does not respond within this time the MRC or OSD will abort startup. |
Services | OSD |
Values | HashStorageLayout |
Default | HashStorageLayout |
Description | Adjusts the internally used storage layout on the OSD. The storage layout determines how an OSD stores its files and objects. Currently, only HashStorageLayout is supported. |
Services | MRC, OSD |
Values | String, but limited to alphanumeric characters, - and . |
Default | |
Description | Must be set to a unique identifier, preferably a UUID according to RFC 4122. UUIDs can be generated with uuidgen. Example: eacb6bab-f444-4ebf-a06a-3f72d7465e40. |
If you installed a pre-packaged release you can start, stop and restart the services with the init.d scripts:
$> /etc/init.d/xtreemfs-dir start $> /etc/init.d/xtreemfs-mrc start $> /etc/init.d/xtreemfs-osd startor
$> /etc/init.d/xtreemfs-dir stop $> /etc/init.d/xtreemfs-mrc stop $> /etc/init.d/xtreemfs-osd stop
To run init.d scripts, root permissions are required. Note that MRC and OSD will wait for the Directory Service to become available before they start up. Once a Directory Service as well as at least one OSD and MRC are running, XtreemFS is operational.
Each XtreemFS service can generate an HTML status page, which displays runtime information about the service (Fig. 3.1). The HTTP server that generates the status page runs on the port defined by the configuration property http_port; default values are 30636 for MRCs, 30638 for Directory Services, and 30640 for OSDs.
The status page of an MRC can e.g. be shown by opening
http://my-mrc-host.com:30636/
with a common web browser. If you set an admin password in the service's configuration, you will be asked for authentication when accessing the status page. Use admin as user name.
The directory service has a built-in notification system that can send alert emails if a service fails to send heartbeat messages for some time. The monitoring can be enabled in the DIR configuration by setting monitoring = true.
Various issues may occur when attempting to set up an XtreemFS server component. If a service fails to start, the log file often reveals useful information. Server log files are located in /var/log/xtreemfs. Note that you can restrict granularity and categories of log messages via the configuration properties debug.level and debug.categories (see Sec. 3.2.5).
If an error occurs, please check if all of the following requirements are met:
The XtreemFS client is needed to access an XtreemFS installation from a local or remote machine. This chapter describes how to use the XtreemFS client in order to work with XtreemFS like a local file system.
There are two different installation sources for the XtreemFS Client: pre-packaged releases and source tarballs.
Note that the source tarball contains the complete distribution of XtreemFS, which also includes server and tools. Currently, binary distributions of the client are only available for Linux and Windows.
To install XtreemFS on Linux, please make sure that FUSE 2.6 or newer, boost 1.35 or newer, openSSL 0.9.8 or newer, libattr and a Linux 2.6 kernel are available on your system. For an optimal performance, we suggest to use FUSE 2.8 with a kernel version 2.6.26 or newer.
On RPM-based distributions (RedHat, Fedora, SuSE, Mandriva) you can install the package with
$> rpm -i xtreemfs-client-1.3.x.rpm
For Debian-based distributions, please use the .deb package provided and install it with
$> dpkg -i xtreemfs-client-1.3.x.deb
For Windows, please use the .msi installer that will guide you through the installation process. For Mac OS X, we provide packaged client with installer.
Extract the tarball with the sources. Change to the top level directory and execute
$> make client
This will build the XtreemFS client and non-Java-based tools. Note that the following third-party packages are required on Linux:
cmake >= 2.6 gcc-c++ >= 4.1 fuse >= 2.6 fuse-devel >= 2.6 boost-devel >= 1.35 openssl-devel >= 0.9.8 libattr-devel >= 2
cmake (>= 2.6) build-essential (>=11) libfuse-dev (>= 2.6) libssl-dev (>= 0.9) libattr-dev (>= 2) libboost-system1.35-dev or later libboost-thread1.35-dev or later libboost-program-options1.35-dev or later libboost-regex1.35-dev or later
When done, execute
$> sudo make install-client
to complete the installation of XtreemFS.
Like many other file systems, XtreemFS supports the concept of volumes. A volume can be seen as a container for files and directories with its own policy settings, e.g. for access control and replication. Before being able to access an XtreemFS installation, at least one volume needs to be set up. This section describes how to deal with volumes in XtreemFS.
Volumes can be created with the mkfs.xtreemfs command line utility. Please see man mkfs.xtreemfs for a full list of options and usage.
When creating a volume, it is recommended to specify the authorization policy (see Sec. 7.2). If not specified, POSIX permissions/ACLs will be chosen by default. Unlike most other policies, authorization policies cannot be changed afterwards.
In addition, it is recommended to set a default striping policy (see Sec. 7.4). If no per-file or per-directory default striping policy overrides the volume's default striping policy, the volume's policy is assigned to all newly created files. If no volume policy is explicitly defined when creating a volume, a RAID0 policy with a stripe size of 128kB and a width of 1 will be used as the default policy.
A volume with a POSIX permission model, a stripe size of 256kB and a stripe width of 1 (i.e. all stripes will reside on the same OSD) can be created as follows:
$> mkfs.xtreemfs -a POSIX -p RAID0 -s 256 -w 1 \ my-mrc-host.com:32636/myVolume
Creating a volume may require privileged access, which depends on whether an administrator password is required by the MRC. To pass an administrator password, add --admin_password <password> to the mkfs.xtreemfs command.
For a complete list of parameters, please refer to the mkfs.xtreemfs man page.
Volumes can be deleted with the rmfs.xtreemfs tool. Deleting a volume implies that any data, i.e. all files and directories on the volume are irrecoverably lost! Please see man rmfs.xtreemfs for a full list of options and usage. Please also note that rmfs.xtreemfs does not dispose of file contents on the OSD. To reclaim storage space occupied by the volume, it is therefore necessary to either remove all files from the volume before deleting it, or to run the cleanup tool (see Section 5.2.2).
The volume myVolume residing on the MRC my-mrc-host.com:32636 can e.g. be deleted as follows:
$> rmfs.xtreemfs my-mrc-host.com:32636/myVolume
Volume deletion is restricted to volume owners and privileged users. Similar to mkfs.xtreemfs, an administrator password can be specified if required.
A list of all volumes can be displayed with the lsfs.xtreemfs tool. All volumes hosted by the MRC my-mrc-host.com:32636 can be listed as follows:
$> lsfs.xtreemfs my-mrc-host.com:32636
Once a volume has been created, it needs to be mounted in order to be accessed.
Before mounting XtreemFS volumes on a Linux machine, please ensure that the FUSE kernel module is loaded. Please check your distribution's manual to see if users must be in a special group (e.g. trusted in openSuSE) to be allowed to mount FUSE file systems.
$> su Password: #> modprobe fuse #> exit
Volumes are mounted with the mount.xtreemfs command:
$> mount.xtreemfs remote.dir.machine/myVolume /xtreemfs
remote.dir.machine describes the host with the Directory Service at which the volume is registered; myVolume is the name of the volume to be mounted. /xtreemfs is the directory on the local file system to which the XtreemFS volume will be mounted. For more options, please refer to man mount.xtreemfs.
Please be aware that the Directory Service URL needs to be provided when mounting a volume, while MRC URLs are used to create volumes.
When mounting a volume, the client will immediately go into background and won't display any error messages. Use the -f option to prevent the mount process from going into background and get all error messages printed to the console. Alternatively, you can use the xtfsutil to print the last 20 errors for a mounted volume.
To check that a volume is mounted, use the mount command. It outputs a list of all mounts in the system. XtreemFS volumes are listed as type fuse:
xtreemfs@localhost:32638/xtreemfs on /xtreemfs type fuse (...)
Volumes are unmounted with the umount.xtreemfs tool:
$> umount.xtreemfs /xtreemfs
On Mac OS X, volumes are unmounted with the regular umount command:
$> umount /xtreemfs
Access to a FUSE mount is usually restricted to the user who mounted the volume. To allow the root user or any other user on the system to access the mounted volume, the FUSE options -o allow_root and -o allow_other can be used with xtfs_mount. They are, however, mutually exclusive. In order to use these options, the system administrator must create a FUSE configuration file /etc/fuse.conf and add a line user_allow_other.
By default, the local system cache on the client machine will be used to speed up read access to XtreemFS. In particular, using the cache as a local buffer is necessary to support the mmap system call, which - amongst others - is required to execute applications on Linux. On the other hand, using buffered I/O may adversely affect throughput when writing large files, as FUSE 5#5 2.7 splits up large writes into multiple individual 4k (page size) writes. In addition, it limits the consistency model of client caches to ``close-to-open'', which is similar to the model provided by NFS. Buffered I/O can be switched off by adding the -o direct_io parameter. The parameter effects that all read and write operations are directed to their OSDs instead of being served from local caches.
Different kinds of problems may occur when trying to create, mount or access files in a volume. If no log file was specified, the client will create a logfile called mount.xtreemfs.log in the current working directory. This logfile is only created in case of an error message. In case no useful error message is printed on the console or in the logfile, it may help to enable client-side log output. This can be done as follows:
$> mount.xtreemfs -f -d DEBUG remote.dir.machine/myVolume /xtreemfs
The following list contains the most common problems and their solutions.
Problem | A volume cannot be created or mounted. |
---|---|
Solution | Please check your firewall settings on the server side. Are all ports accessible? The default ports are 32636 (MRC), 32638 (DIR), and 32640 (OSD).
In case the XtreemFS installation has been set up behind a NAT, it is possible that services registered their NAT-internal network interfaces at the DIR. In this case, clients cannot properly resolve server addresses, even if port forwarding is enabled. Please check the Address Mappings section on the DIR status page to ensure that externally reachable network interfaces have been registered for the your servers' UUIDs. If this is not the case, it is possible to explicitly specify the network interfaces to register via the hostname property (see Sec. 3.2.5). |
Problem | An error occurs when trying to access a mounted volume. |
---|---|
Solution | Please make sure that you have sufficient access rights to the volume root. Superusers and volume owners can change these rights via chmod <mode> <mountpoint>. If you try to access a mount point to which XtreemFS was mounted by a different user, please make sure that the volume is mounted with xtfs_mount -o allow_other .... |
Problem | An I/O error occurs when trying to create new files. |
---|---|
Solution | In general, you can check the contents of the client log file to see the error which caused the I/O error. A common reason for this problem is that no OSD could be assigned to the new file. Please check if suitable OSDs are available for the volume. There are two alternative ways to do this:
|
Problem | An I/O error occurs when trying to access an existing file. |
---|---|
Solution | Please check whether all OSDs assigned to the file are running and reachable. This can be done as follows:
|
To make use of most of the advanced XtreemFS features, XtreemFS offers a variety of tools. There are tools that support administrators with the maintenance of an XtreemFS installation, as well as tools for controlling features like replication and striping. An overview of the different tools with descriptions of how to use them are provided in the following.
The user tools are built, packaged and installed together with the XtreemFS client. For details on how to install the XtreemFS client, please refer to Section 4.1.
To install XtreemFS admin tools, you can choose from two different installation sources: you can download one of the pre-packaged releases that we create for most Linux distributions or you can install directly from the source tarball.
Note that the source tarball contains the complete distribution of XtreemFS, which also includes client and server. Currently, binary distributions of the admin tools are only available for Linux.
For the pre-packaged release, you will need Sun Java JRE 1.6.0 or newer to be installed on the system. Some tools also require the attr/libattr package to be installed.
When building XtreemFS directly from the source, you need a Sun Java JDK 1.6.0 or newer, Ant 1.6.5 or newer and gmake.
On RPM-based distributions (RedHat, Fedora, SuSE, Mandriva) you can install the package with
$> rpm -i xtreemfs-tools-1.3.x.rpm xtreemfs-backend-1.3.x.rpm
For Debian-based distributions, please use the .deb package provided and install it with
$> dpkg -i xtreemfs-tools-1.3.x.deb xtreemfs-backend-1.3.x.deb
To install the tools, the following package is required: jre 3#3 1.6.0 for RPM-based releases, java6-runtime for Debian-based releases. If you already have a different distribution of Java6 on your system, you can alternatively install the XtreemFS tools packages as follows:
$> rpm -i --nodeps xtreemfs-tools-1.3.x.rpm \ xtreemfs-backend-1.3.x.rpm
on RPM-based distributions,
$> dpkg -i --ignore-depends java6-runtime \ xtreemfs-tools-1.3.x.deb xtreemfs-backend-1.3.x.deb
on Debian-based distributions.
To ensure that your local Java6 installation is used, is necessary to set the JAVA_HOME environment variable to your Java6 installation directory, e.g.
$> export JAVA_HOME=/usr/java6
All XtreemFS tools will be installed to /usr/bin.
Extract the tarball with the sources. Change to the top level directory and execute
$> make server
When done, execute
$> sudo make install-tools
to complete the installation. Note that this will also install the XtreemFS client and servers.
This section describes the tools that support administrators in maintaining an XtreemFS installation.
The database format in which the MRC stores its file system metadata on disk may change with future XtreemFS versions, even though we attempt to keep it as stable as possible. To ensure that XtreemFS server components may be updated without having to create and restore a backup of the entire installation, it is possible to convert an MRC database to a newer version by means of a version-independent XML representation.
This is done as follows:
xtfs_mrcdbtool is a tool that is capable of doing this. It can create an XML dump of an MRC database as follows:
$> xtfs_mrcdbtool -mrc pbrpc://my-mrc-host.com:32636 \ dump /tmp/dump.xml
A file dump.xml containing the entire database content of the MRC running on my-mrc-host.com:32636 is written to /tmp/dump.xml. For security reasons, the dump file will be created locally on the MRC host. To make sure that sufficient write permissions are granted to create the dump file, we therefore recommend to specify an absolute dump file path like /tmp/dump.xml.
A database dump can be restored from a dump file as follows:
$> xtfs_mrcdbtool -mrc pbrpc://my-mrc-host.com:32636 \ restore /tmp/dump.xml
This will restore the database stored in /tmp/dump.xml at my-mrc-host.com. Note that for safety reasons, it is only possible to restore a database from a dump if the database of the running MRC does not have any content. To restore an MRC database, it is thus necessary to delete all MRC database files before starting the MRC.
Please be aware that dumping and restoring databases may both require privileged access rights if the MRC requires an administrator password. The password can be specified via --admin_password; for further details, check the xtfs_mrcdbtool man page.
In real-world environments, errors occur in the course of creating, modifying or deleting files. This can cause corruptions of file data or metadata. Such things happen e.g. if the client is suddenly terminated, or loses connection with a server component. There are several such scenarios: if a client writes to a file but does not report file sizes received from the OSD back to the MRC, inconsistencies between the file size stored in the MRC and the actual size of all objects in the OSD will occur. If a client deletes a file from the directory tree, but cannot reach the OSD, orphaned objects will remain on the OSD. If an OSD is terminated during an ongoing write operation, file content will become corrupted.
In order to detect and, if possible, resolve such inconsistencies, tools for scrubbing and OSD cleanup exist. To check the consistency of file sizes and checksums, the following command can be executed:
$> xtfs_scrub -dir pbrpc://my-dir-host.com:32638 myVolume
This will scrub each file in the volume myVolume, i.e. check file size consistency and set the correct file size on the MRC, if necessary, and check whether an invalid checksum in the OSD indicates a corrupted file content. The -dir argument specifies the directory service that will be used to resolve service UUIDs. Please see man xtfs_scrub for further details.
A second tool scans an OSD for orphaned objects, which can be used as follows:
$> xtfs_cleanup -dir pbrpc://localhost:32638 \ uuid:u2i3-28isu2-iwuv29-isjd83The given UUID identifies the OSD to clean and will be resolved by the directory service defined by the -dir option (localhost:32638 in this example). The process will be started and can be stopped by setting the option -stop. To watch the cleanup progress use option -i for the interactive mode. For further information see man xtfs_cleanup.
The service's status field is shown in the service status page as static.status. The status can be 0 (online), 1 (marked for removal) and 2 (dead/removed). Status 0 (online) is the regular status for all services, even if they are temporarily offline. Status 2 (dead/removed) marks an OSD as permanently failed and the scrubber will removed replicas and files from these OSDs. Status 1 (marked for removal) is for future use.
The status can be set with the xtfs_chstatus tool:
$> xtfs_chstatus -dir pbrpc://localhost:32638 \ u2i3-28isu2-iwuv29-isjd83 online
This command sets the status of the service with the UUID u2i3-28isu2-iwuv29-isjd83 to online.
XtreemFS is capable of taking file system snapshots. A snapshot captures an instantaneous image of all files and directories in a volume, which can later be accessed in a read-only manner.
Snapshots can be created, listed and deleted with the xtfs_snap tool. A mounted volume is necessary to run the tool; information on how to mount volumes can be found in Section 4.3.
As snapshots cause an additional storage and I/O overhead since they require copy-on-write versioning of files across the OSDs, it is first necessary to enable them on a volume. Snapshots can be enabled as follows:
$> xtfs_snap --enable -d /path/to/mounted/volume
Once snapshots have been enabled, a snapshot named mySnapshot can be taken as follows:
$> xtfs_snap -c -r -d /path/to/mounted/volume/subdirectory \ mySnapshot
The optional -r parameter enables a recursive capturing that includes all subdirectories beneath the XtreemFS directory subdirectory.
A list of all snapshots that exist on the volume can be displayed as follows:
$> xtfs_snap -l -d /path/to/mounted/volume mySnapshot
Snapshots are exposed as read-only volumes. To access a snapshot, it is necessary to mount it. The volume name is composed of the original volume name and the snapshot name, separated by an @ character. Mounting a snapshot works as follows:
$> mount.xtreemfs localhost/volume@mySnapshot \ /path/to/mounted/volume2
A mounted volume snapshot can be browsed normally, and all files can be read as on the original volume. However, any attempt to write data on a snapshot will result in an EPERM error.
A snapshot mySnapshot that is no longer needed can be removed as follows:
$> xtfs_snap -x -d /path/to/mounted/volume mySnapshot
Please be aware that removing a snapshot does not automatically reclaim storage space from all prior versions. To dispose of obsolete and redundant versions on a specific OSD, it is necessary to perform a version cleanup run with the xtfs_cleanup tool:
$> xtfs_cleanup -dir localhost:32638 -v \ uuid:8bca70da-c963-43c7-b30b-d0d605d39fa7
Note: A snapshot only captures a file in its current state if it is closed. Files that are open when taking a snapshot are captured in the last state in which they were before they were opened. Since files are implicitly closed on an OSD through a timeout rather than an explicit close call, it may happen that files are not included in a snapshot despite having been closed at application level before the snapshot was taken. To make sure a change to a specific file is included in a subsequent snapshot, it is necessary to wait for the close timeout on the OSD before taking the snapshot, which by default is set to 60 seconds.
Since release 1.3, all user tools have been replaced by the xtfsutil tool. xtfsutil displays XtreemFS specific file and directory information, manages file replicas and volume policies.
When called without any option xtfsutil prints the XtreemFS specific information for a volume, directory, softlink or file.
$> cd /xtreemfs $> echo 'Hello World' > test.txt $> xtfsutil test.txt
will produce output similar to the following:
Path (on volume) /test.txt XtreemFS file Id 1089e4fb-9eb9-46ea-8acf-91d10c2170e3:2 XtreemFS URL pbrpc://localhost:32638/xtreemfs/test.txt Owner user Group users Type file Replication policy WqRq XLoc version 0 Replicas: Replica 1 Striping policy STRIPING_POLICY_RAID0 / 1 / 128kB Replication Flags partial OSD 1 test-osd1/127.0.0.1:32641 Replica 2 Striping policy STRIPING_POLICY_RAID0 / 1 / 128kB Replication Flags partial OSD 1 test-osd0/127.0.0.1:32640 Replica 3 Striping policy STRIPING_POLICY_RAID0 / 1 / 128kB Replication Flags partial OSD 1 test-osd2/127.0.0.1:32642
The fileID is the unique identifier within XtreemFS, e.g. used by the OSD to identify the file's objects. The owner/group fields are shown as reported by the MRC, you may see other names on your local system if there is no mapping (i.e. the file owner does not exist as a user on your local machine). The XtreemFS URL shows you on which MRC the volume is hosted and the name of the volume. This file has three replicas and is replicated with the WqRq policy (majority voting).
The replication policy defines how a file is replicated. The policy can only be changed for a file that has no replicas. If you wish to change the policy for a replicated files, you have to remove all replicas first.
To change the replication policy, execute xtfsutil with the following options:
$> xtfsutil --set-replication-policy ronly /xtreemfs/test.txt
The following values can be passed to --set-replication-policy
:
Replicas can be added for files that have a replication policy defined, i.e. not none. When adding a replica, you need to specify on which OSD to create the new replica. Alternatively, you can use auto instead of an OSD UUID. With auto set, the xtfsutil will automatically select an OSD.
To add a replica execute:
$> xtfsutil --add-replica auto /xtreemfs/test.txt
For read-only replicated files, replicas are partial by default. To create a full replica, you can use the --full
flag when adding a replica. For read-write replicated files, all replicas are equal and there is no further options.
In case you want to select an OSD for a new replica manually, you can retrieve a list of up to 10 OSDs for a file. The MRC automatically filters and sorts the list of OSDs depending on the policies set for a volume. In addition, the MRC also excludes all OSDs that already have a replica of that file. To retrieve this list execute:
$> xtfsutil --list-osds /xtreemfs/test.txt OSDs suitable for new replicas: test-osd1 test-osd2
To remove a replica, pass the OSD's UUID to xtfsutil:
$> xtfsutil --delete-replica test-osd1 /xtreemfs/test.txt
To display the volume policies and settings, execute xtfsutil on the mountpoint without any options.
$> xtfsutil /xtreemfs
will produce output similar to the following:
Path (on volume) / XtreemFS file Id 1089e4fb-9eb9-46ea-8acf-91d10c2170e3:1 XtreemFS URL pbrpc://localhost:32638/replicated Owner user Group users Type volume Free/Used Space 24 GB / 6 bytes Num. Files/Dirs 1 / 1 Access Control p. 2 OSD Selection p. 1000,3002 Replica Selection p. default Default Striping p. STRIPING_POLICY_RAID0 / 1 / 128kB Default Repl. p. WqRq with 3 replicas
Currently, it is not possible to change the striping policy of an existing file, as this would require rearrangements and transfers of data between OSDs. However, it is possible to define individual striping policies for files that will be created in the future. This can be done by changing the default striping policy of the parent directory or volume.
The striping policy can be changed with xtfsutil as follows:
$> xtfsutil --set-dsp -p RAID0 -w 4 -s 256 /xtreemfs
This will cause a RAID0 striping policy with 256kB stripe size and four OSDs to be assigned to all newly created files in /xtreemfs.
When creating a new file, XtreemFS will first check whether a default striping policy has been assigned to the file's parent directory. If this is not the case, the default striping policy for the volume will be used as the striping policy for the new file. Changing a volume's or directory's default striping policy requires superuser access rights, or ownership of the volume or directory.
The Default Replication Policy defines how new files on a volume are replicated. This policy can be set on the volume and is valid for all sub-directories. It affects only new files and doesn't modify the replication settings for existing files.
The replication policy can be changed as follows. In this example, all files will have three replicas with WqRq mode.
$> xtfsutil --set-drp --replication-policy WqRq \ --replication-factor 3 /xtreemfs
The following values can be passed to --replication-policy
:
--full
flag is set.
When creating a new file, OSDs have to be selected on which to store the file content. Likewise, OSDs have to be selected for a newly added replica, as well as the order in which replicas are contacted when accessing a file. How these selections are done can be controlled by the user.
OSD and replica selection policies can only be set for the entire volume. Further details about the policies are described in Sec. 7.3.
The policies are set and modified with the xtfsutil tool on the volume (mount point). When called without any options, xtfsutil will also show the policies currently set for the volume. A policy that controls the selection of a replica is set as follows:
$> xtfsutil --set-rsp dcmap /xtreemfs
This will change the current replica selection policy to a policy based on a data center map.
Note that by default, there is no replica selection policy, which means that the client will attempt to access replicas in their natural order, i.e. the order in which the replicas have been created.
Similar to replica selection policies, OSD selection policies are set and retrieved:
$> xtfsutil --set-osp dcmap /xtreemfs
sets a data center map-based OSD selection policy, which is invoked each time a new file or replica is created. The following predefined policies exist (see Sec. 7.3 and man xtfsutil for details):
In addition, custom policies can be set by passing a list of basic policy IDs to be successively applied instead of a predefined policy name.
OSD and replica selection policy behavior can be further specified by means of policy attributes. For a list of predefined attributes, see Section 7.3. Policy attributes can be set as follows:
$> xtfsutil --set-pattr domains --value "*.xtreemfs.org bla.com" \ /xtreemfs
A list of all policy attributes that have been set can be shown as follows:
$> xtfsutil --list-pattrs /xtreemfs
In some cases, it may be necessary to enforce access control on a file or directory at a finer granularity than expressible with simple ``rwx''-like access rights. XtreemFS supports Access Control Lists (ACLs) to set individual access rights for users and groups.
An ACL entry for the user someone with the value rx (``read or execute'') can be added as follows:
$> xtfsutil --set-acl u:someone:rx /xtreemfs
An existing entry can be removed as follows:
$> xtfsutil --del-acl u:someone /xtreemfs
Please be aware that when files or directories are accessed, the actual evaluation of ACL entries depends upon the effective authorization policy on the volume (see Section 7.2). With a POSIX authorization policy, ACL entries will be evaluated as described at http://www.suse.de/ agruen/acl/linux-acls/onlinehttp://www.suse.de/~agruen/acl/linux-acls/online.
$> xtfs_vivaldi remote.dir.machine \ /var/lib/xtreemfs/vivaldi_coordinates
If started with the init.d script, the utility will get the DIR address from
/etc/xos/xtreemfs/default_dir and will store the coordinates in
/var/lib/xtreemfs/vivaldi_coordinates.
The coordinate file must be passed as an argument when mounting a volume:
$> mount.xtreemfs --vivaldi-coordinates-file-path \ /var/lib/xtreemfs/vivaldi_coordinates \ remote.dir.machine/myVolume /xtreemfs
Finally, the vivaldi replica and OSD selection policies must be set at the MRC for the volume(s). See Sec. 5.3.3 for details.
XtreemFS offers replication of all data. On the one hand, the Directory Service (DIR) and the Metadata Catalog (MRC) are replicated at database level. On the other hand, files are replicated on the OSDs with read/write or with read-only replication. In this chapter, we describe how these replication mechanisms work, their requirements and potential use-cases.
The replication of files adds significant communication overhead to keep replicas in sync. When a file is opened, the OSD which the client contacts requires at least three message round-trips to acquires the lease and to execute the replica reset. Once a primary was elected, read operations can be executed locally without any communication. Truncate and write require a single round-trip between the primary and the backup OSDs.
Depending on the selected replication policy, the read/write replication can tolerate some replica failures. The WqRq policy employs majority voting and can tolerate replica failures as long as a majority of replicas is available. This is the most fault-tolerant strategy in XtreemFS. However, it guarantees only that data is stored on a majority of the replicas. If you lose more replicas permanently, data might be lost. The WaRa policy writes updates to all replicas which yields higher data safety. However, this policy cannot tolerate replica failures.
A replication policy can either be specified for an existing file or as a default policy for the entire volume. In the former case, replicas need to be added manually. In the latter case, a default replication factor needs to be specified that defines the number of replicas that are initially created. Please be aware that a default replication policy only affects newly created files, i.e. does not automatically add replicas to existing files!
For details on how to define replication policies, please refer to Section 5.3.1 and 5.3.2.
Read-only replicas are either full or partial. Full replicas immediately copy the file data from other replicas when they are created. XtreemFS uses a rarest-first strategy (similar to BitTorrent) to increase the replication factor as quickly as possible. In contrast, partial replicas are initially empty and fetch the file data (objects) on demand when requested by a client. Partial replicas also pre-fetch a small number of objects to reduce latency for further client reads.
To configure multiple MRC instances as replicas of each other, it is necessary to enable and configure the replication plug-in across these instances. This is done by setting the property babudb.plugin.0 in the configuration file of each MRC instance, such that it points to the plug-in's configuration file. If the xtreemfs-server package has been installed, a default configuration file for the replication plug-in can be found at /etc/xos/xtreemfs/server-repl-plugin/mrc.properties. In order to activate the plug-in, open /etc/xos/xtreemfs/mrcconfig.properties with a text editor and enter (or un-comment) the following line:
babudb.plugin.0 = /etc/xos/xtreemfs/server-repl-plugin/mrc.properties
Now, it is necessary to configure the replication plug-in. For this purpose, open /etc/xos/xtreemfs/server-repl-plugin/mrc.properties with a text editor. The configuration file will look as follows:
# number of servers that at least have to be up to date babudb.repl.sync.n = 2 ... # participants of the replication including the local address # (may be missing, if localhost was defined explicitly) babudb.repl.participant.0 = localhost babudb.repl.participant.0.port = 35676 babudb.repl.participant.1 = somehost babudb.repl.participant.1.port = 35676 ...
babudb.repl.sync.n defines the number of servers that need to respond to an update before acknowledging the update to the client. To ensure data safety in the face of failures, it is necessary to set the property to a number that reflects at least a majority of all replicas. The list of replicas can be extended arbitrarily by adding new babudb.repl.participant.n as well as babudb.repl.participant.n.port properties, where n defines the replica number. Host names have to be resolvable, and hosts have to be able to reach each other on the respective ports. Please also make sure that replica lists are equivalent across all replicated MRC instances, i.e. each can reach all other hosts in the replica set.
Note that it is necessary to explicitly enable SSL if server-to-server authentication and encryption between replicas are required, regardless of whether an SSL-based XtreemFS installation was set up. This is because BabuDB establishes its own connection to exchange data with other replicated instances.
Please make sure that all replicated instances have consistent configurations before starting them up, which includes replica lists, babudb.repl.sync.n parameters as well as SSL settings if necessary.
XtreemFS supports a range of predefined policies for different tasks. Alternatively, administrators may define their own policies in order to adapt XtreemFS to customer demands. This chapter contains information about predefined policies, as well as mechanisms to implement and plug in custom policies.
The following predefined authentication providers exist:
The NullAuthProvider is the default Authentication Provider. It simply uses the user ID and group IDs sent by the XtreemFS client. This means that the client is trusted to send the correct user/group IDs.
The XtreemFS Client will send the user ID and group IDs of the process which executed the file system operation, not of the user who mounted the volume!
The superuser is identified by the user ID root and is allowed to do everything on the MRC. This behavior is similar to NFS with no_root_squash.
XtreemFS supports two kinds of X.509 certificates which can be used by the client. When mounted with a service/host certificate the XtreemFS client is regarded as a trusted system component. The MRC will accept any user ID and groups sent by the client and use them for authorization as with the NullAuthProvider. This setup is useful for volumes which are used by multiple users.
The second certificate type are regular user certificates. The MRC will only accept the user name and group from the certificate and ignore the user ID and groups sent by the client. Such a setup is useful if users are allowed to mount XtreemFS from untrusted machines.
Both certificates are regular X.509 certificates. Service and host certificates are identified by a Common Name (CN) starting with host/ or xtreemfs-service/, which can easily be used in existing security infrastructures. All other certificates are assumed to be user certificates.
If a user certificate is used, XtreemFS will take the Distinguished Name (DN) as the user ID and the Organizational Unit (OU) as the group ID.
Superusers must have xtreemfs-admin as part of their Organizational Unit (OU).
Replica selection is a related problem. When a client opens a file with more than one replica, the MRC uses a replica selection policy to sort the list of replicas for the client. Initially, a client will always attempt to access the first replica in the list received from the MRC. If a replica is not available, it will automatically attempt to access the next replica from the list, and restart with the first replica if all attempts have failed. Replica selection policies can be used to sort the replica lists, e.g. to ensure that clients first try to access replicas that are close to them.
Both OSD and replica selection policies share a common mechanism, in that they consist of basic policies that can be arbitrarily combined. Input parameters of a basic policy are a set of OSDs, the list of the current replica locations of the file, and the IP address of the client on behalf of whom the policy was called. The output parameter is a filtered and potentially sorted subset of OSDs. Since OSD lists returned by one basic policy can be used as input parameters by another one, basic policies can be chained to define more complex composite policies.
OSD and replica selection policies are assigned at volume granularity. For further details on how to set such policies, please refer to Sec. 5.3.3.
The behavior of basic policies can be further refined by means of policy attributes. Policy attributes are extended attributes with a name starting with xtreemfs.policies., such as xtreemfs.policies.minFreeCapacity. Each time a policy attribute is set, all policies will be notified about the change. How an attribute change affects the policy behavior depends on the policy implementation.
Each basic policy can be assigned to one of the three different categories called filtering, grouping and sorting. Filtering policies generate a sub-list from a list of OSDs. The sub-list only contains those OSDs from the original list that have a certain property. Grouping policies are used to select a subgroup from a given list of OSDs. They basically work in a similar manner as filtering policies, but unlike filtering policies, they always return a list of a fixed size. Sorting policies generate and return a reordered list from the input OSD list, without removing any OSDs.
The following predefined policies exist:
Attributes:
Attributes:
This policy uses a statically configured datacenter map that describes the distance between datacenters. It works only with IPv4 addresses at the moment. Each datacenter has a list of matching IP addresses and networks which is used to assign clients and OSDs to datacenters. Machines in the same datacenter have a distance of 0.
This policy requires a datacenter map configuration file in
/etc/xos/xtreemfs/datacentermap on the MRC machine which is loaded at MRC startup. This config file must contain the following parameters:
A sample datacenter map could look like this:
datacenters=BERLIN,LONDON,NEW_YORK distance.BERLIN-LONDON=10 distance.BERLIN-NEW_YORK=140 distance.LONDON-NEW_YORK=110 addresses.BERLIN=192.168.1.0/24 addresses.LONDON=192.168.2.0/24 addresses.NEW_YORK=192.168.3.0/24,192.168.100.0/25 max_cache_size=100
This policy uses domain names of clients and OSDs to determine the distance between a client and an OSD, as well as if OSDs are in the same domain.
XtreemFS allows the content, i.e. the objects of a file to be distributed among several storage devices (OSDs). This has the benefit that the file can be read or written in parallel on multiple OSDs in order to increase throughput. To configure how files are striped, XtreemFS supports striping policies.
A striping policy is a rule that defines how the objects are distributed on the available OSDs. Currently, XtreemFS implements only the RAID0 policy which simply stores the objects in a round robin fashion on the OSDs. The RAID0 policy has two parameters. The striping width defines to how many OSDs the file is distributed. If not enough OSDs are available when the file is created, the number of available OSDs will be used instead; if it is 0, an I/O error is reported to the client. The stripe size defines the size of each object.
Striping over several OSDs enhances the read and write throughput to a file. The maximum throughput depends on the striping width. However, using RAID0 also increases the probability of data loss. If a single OSD fails, parts of the file are no longer accessible, which generally renders the entire file useless. Replication can mitigate the problem but has all the restrictions described in Sec. .
To further customize XtreemFS, the set of existing policies can be extended by defining plug-in policies. Such policies are Java classes that implement a predefined policy interface. Currently, the following policy interfaces exist:
Note that there may only be one authentication provider per MRC, while file access policies and OSD selection policies may differ for each volume. The former one is identified by means of its class name (property authentication_provider, see Sec. 3.2.3, 3.2.5), while volume-related policies are identified by ID numbers. It is therefore necessary to add a member field
public static final long POLICY_ID = 4711;
to all such policy implementations, where 4711 represents the individual ID number. Administrators have to ensure that such ID numbers neither clash with ID numbers of built-in policies (1-9), nor with ID numbers of other plug-in policies. When creating a new volume, IDs of plug-in policies may be used just like built-in policy IDs.
Plug-in policies have to be deployed in the directory specified by the MRC configuration property policy_dir. The property is optional; it may be omitted if no plug-in policies are supposed to be used. An implementation of a plug-in policy can be deployed as a Java source or class file located in a directory that corresponds to the package of the class. Library dependencies may be added in the form of source, class or JAR files. JAR files have to be deployed in the top-level directory. All source files in all subdirectories are compiled at MRC start-up time and loaded on demand.
XtreemFS is a distributed file system that can be used instead of HDFS the distributed file system made by the developers of Hadoop.
Therefore it replaces the NameNode and the DataNodes provided by HDFS in a common Hadoop setup. A DIR is used instead of a NameNode, because it stores the information about where the files and there metadata are located at the OSDs and the MRC, like the NameNode does for DataNodes. These DataNodes hold the files that have been stored at HDFS. On XtreemFS these files are split into metadata and raw file data to be stored seperated at a MRC and OSDs.
The three master services JobTracker, DIR and MRC are required in a Hadoop configuration. They can run alone or in arbitrary combinations on the same machine. Hadoop can be used with an arbitrary number of Slaves. It is recommended to run a TaskTracker together with an OSD on each Slave machine to improve performance, but it is not mandatory.
This section will help you to set up a simple Hadoop configuration with all necessary services running on the same host.
Required software:
Setup:
<configuration> <property> <name>fs.xtreemfs.impl</name> <value>org.xtreemfs.common.clients.hadoop.XtreemFSFileSystem</value> <description>The FileSystem for xtreemfs: uris.</description> </property> <property> <name>fs.default.name</name> <value>xtreemfs://localhost:32638</value> <description>Address for the DIR.</description> </property> <property> <name>xtreemfs.volumeName</name> <value>volumeName</value> <description>Name of the volume to use within XtreemFS.</description> </property> </configuration>
<property> <name>xtreemfs.client.userid</name> <value>hadoopUserID</value> <description>UserID to be used by Hadoop while accessing XtreemFS.</description> </property> <property> <name>xtreemfs.client.groupid</name> <value>hadoopGroupID</value> <description>GroupID to be used by Hadoop while accessing XtreemFS.</description> </property>
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description>Listening address for the JobTracker.</description> </property> </configuration>Which specifies the address where the JobTracker will be running at.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 xtfs-guide.tex
The translation was initiated by Bjoern on 2011-08-09