PVFSv2 Documentation
PVFS version 2 is currently under test on the u2 cluster, and is only available to specific users who have access to the "pvfs" queue.
There is a PVFSv2 filesystem spanning nodes c29n17 through c29n31.
Each of these nodes is a server and a metadata server.
There is no redundancy in the configuration in either the data, or the metadata.
If any of these nodes is unavailable some, or (in practice) all of the filesystem is unavailable.
The client nodes c29n01 through c29n16 mount this PVFS filesystem on boot
on the directory:
/pvfs2/scratch
You can set parameters on a directory such that all files subsequently
created in that directory will use that data distribution e.g.:
c29n01$ mkdir /pvfs2/scratch/somedir c29n01$ setfattr -n "user.pvfs2.dist_name" -v "twod_stripe" \ /pvfs2/scratch/somedir c29n01$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:8192" \ /pvfs2/scratch/somedir c29n01$
You can read the parameters set on a directory as follows:
c29n01$ getfattr /pvfs2/scratch/somedir/ getfattr: Removing leading '/' from absolute path names # file: pvfs2/scratch/somedir user.pvfs2.dist_name user.pvfs2.dist_params c29n01$
i.e. user.pvfs2.dist_name and user.pvfs2.dist_params have been set on
this dir - now reading the values of those parameters:
c29n01$ getfattr -n "user.pvfs2.dist_name" /pvfs2/scratch/somedir/ getfattr: Removing leading '/' from absolute path names # file: pvfs2/scratch/somedir user.pvfs2.dist_name="twod_stripe" c29n01$ c29n01$ getfattr -n "user.pvfs2.dist_params" /pvfs2/scratch/somedir/ getfattr: Removing leading '/' from absolute path names # file: pvfs2/scratch/somedir user.pvfs2.dist_params="strip_size:8192" c29n01$
We are running PVFS v2 version 2.7.1 with patches to fix problems with
the basic , 2D stripe, and variable strip distributions.
Summary of the Distribution Settings:
"basic_dist" == one datafile per server
$ setfattr -n "user.pvfs2.dist_name" -v "basic_dist" dir
[no settable parameters]
Note: If you create a directory and set the distribution for the
directory to "basic_dist" all files created in that directory will
be stored on one server
"simple_stripe" == stripe over servers (all servers by default)
Note: This is the default distribution
$ setfattr -n "user.pvfs2.dist_name" -v "simple_stripe" dir
set strip size in bytes with (e.g.):
$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:4096" dir
Note: The strip size is the amount of data written to each server in a stripe,
it is NOT the stripe size i.e.
stripe size == strip_size * number of servers in the stripe
set the number of servers for the stripe (stripe = strip * servers) with (e.g.):
$ setfattr -n "user.pvfs2.num_dfiles" -v "8" dir
default strip size = 64K
...with 16 servers the stripe size is 16 * 64K = 1M stripe
"twod_stripe" == Two dimensional striping pattern over servers
This distribution is designed to combat incast
As per Kyle Schochenmaler:
The twod-stripe will take all of the servers in the filesystem and partition
them into num_groups groups. Data will then be striped to each group before we
move onto the next group. The strip_factor will determine how many chunks of
strip_size are written to each server in each group before we transition to the
next group. The striping on the group level is done round-robin in the same
fashion as simple-stripe
$ setfattr -n "user.pvfs2.dist_name" -v "twod_stripe" dir
set strip [NOT stripe] size in bytes with (e.g.):
$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:32678" dir
set number of groups (e.g.):
$ setfattr -n "user.pvfs2.dist_params" -v "num_groups:4" dir
set the group strip factor (e.g.)
$ setfattr -n "user.pvfs2.dist_params" -v "group_strip_factor:8" dir
Note: These settings are NOT additive, so to set 32KB strip, 4 groups with strip factor 8:
$ setfattr -n "user.pvfs2.dist_params" \ -v "strip_size:32768,num_groups:4,group_strip_factor:8" dir
Note: These settings are separated by commas ','
If you use the above setting on a directory in /pvfs2/scratch which has
16 servers, you set "num_groups" to 4 so you will create:
16/4 ==> 4 groups of 4 servers each.
Each server in the first group will have "strip_size" bytes, that is 32KB,
written to it; creating a stripe across these 4 servers.
stripe size == strip_size * number of servers in the stripe
... in this case:
32KB * 4 == 128KB stripe
So a 128K stripe will be written across the 4 servers in the first group.
With "group_strip_factor" set to 8, 8 of these 128KB stripes will be written
to the first group, that is:
8 * 128KB = 1024KB == 1MB of data
Then 1MB will then be written to the next group of 4 servers in the same way etc.
default strip size = 64K
default number of groups = 2
default strip factor = 256
"varstrip_dist" == control which servers to stripe across and the strip size on each server
$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dir
The format of "user.pvfs2.dist_params" (based on online examples
and pvfs2-dist-varstrip.h):
"strips:node_number:size;node_number:size[;node_number:size]...
example: "strips:0:512;1:512"
results in: node number 0: bytes 0-511
node number 1: bytes 512-1023
For example the following should give a 256K stripe over the
16 servers (the same as "simple_stripe" with "strip_size:16384")
$ setfattr -n "user.pvfs2.dist_params" -v "strips:0:16384;\
1:16384;2:16384;3:16384;4:16384;5:16384;6:16384;7:16384;\
8:16384;9:16384;10:16384;11:16384;12:16384;13:16384;14:16384;\
15:16384" dir
$
This would be more useful for creating (e.g.) two directories
with non overlapping 256K stripes, each over 8 servers like this:
$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dira $ setfattr -n "user.pvfs2.dist_params" -v "strips:0:32768;\ 1:32768;2:32768;3:32768;4:32768;5:32768;6:32768;7:32768" dira $ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dirb $ setfattr -n "user.pvfs2.dist_params" -v "strips:8:32768;\ 9:32768;10:32768;11:32768;12:32768;13:32768;14:32768;\ 15:32768" dirb $
Note: The "node_number:strip_size" pairs are separated by semicolons ';'
There are no default parameters per se, but if you're going to use
this distribution you will want to set the distribution parameters
anyhow...
To use MPICH2 with ROMIO for PVFSv2 with the GNU gcc compiler you will
need to load the mpich2 module as follows:
$ module load mpich2/gcc-3.4.6/1.0.7-pvfs2
Note: Remember to add this command to your PBS job scripts to set
the runtime library path
Summary of Performance Tests against the /pvfsv2/scratch filesystems
The following is a summary of the performance tests against the PVSFv2 filesystem
served from 16 nodes, each node running as a data & metadata server.
The filesystem was mounted on 16 client (only) nodes on the /pvfsv2/scratch directory.
---------------------------------------------------------------------------------
Test data written to a subdirectory of /pvfsv2/scratch using the "simple_stripe"
distribution over a 256K stripe with iozone writing 4GB of data from each client
in 256K blocks.
Mean Average of test values:
| Initial write |
321,058.83 KB/sec |
| Rewrite |
212,626.53 KB/sec |
| Read |
161,611.95 KB/sec |
| Re-read |
154,799.58 KB/sec |
[Detailed test results for /pvfsv2/scratch simple_stripe tests]
Test data written to a subdirectory of /pvfsv2/scratch using the "twod_stripe"
distribution using 4 groups of 4 servers, striping 4 x 64KB = 256KB stripe over
each group, writing 8 * 256KB = 2MB to each group in turn.
iozone writing 4GB of data from each client in 256K blocks.
Mean Average of test values:
| Initial write |
365,058.37 KB/sec |
| Rewrite |
243,681.57 KB/sec |
| Read |
175,666.28 KB/sec |
| Re-read |
171,005.77 KB/sec |
[Detailed test results for /pvfsv2/scratch twod_stripe tests]
Test data written to a subdirectory of /pvfsv2/scratch using the "varstrip_dist"
distribution over a 256K stripe with iozone writing 4GB of data from each client
in 256K blocks.
The distribution was configured to write 16K to each node, thereby creating a
256K stripe over the 16 server nodes.
Mean Average of test values:
| Initial write |
298,962.00 KB/sec |
| Rewrite |
197,318.79 KB/sec |
| Read |
203,210.17 KB/sec |
| Re-read |
186,866.10 KB/sec |
[Detailed test results for /pvfsv2/scratch varstrip_dist tests]
[Full Test Results for /pvfs2/scratch tests]
PVFS v2 filesystem on-the-fly
I've been working on an option for creating a PVFSv2 filesystem on
the fly for a PBS job. The filesystem is created across all the
nodes in the job using space in the local /scratch directory
This is working in my tests when 2 processors per node are requested
for the BPS job (i.e. the nodes are dedicated to the PBS job) but I
have seen problems when running multiple jobs using one processor
per node i.e there are two PVFS filesystems created with two servers
running on each node. With one processor per node I have seen problems
mounting the on-the-fly filesystem - I'mm working on this now...
If you add the tag "PVFS" to a PBS job script a PVFS filesystem
is created across all the (compute) nodes in the job.
This tag should not be used for jobs that use the PVFSv2 filesystem
/pvfs2/scratch (served from nodes c29n17 though c29n32)
This tag is only available on nodes in the pvfs queue
(nodes c29n01 through c29n16) for now because the pvfs and
pvfs-kernelmodule RPMs are required to create & mount the
filesystem
The filesystem is mounted on the directory:
${PBSTMPDIR}/mnt
I source a bourne script which sets several PVFS environment variables:
source /var/spool/pbs/mom_priv/pvfs_globals ${PBS_JOBID}
...though this is not required to access the filesystem
Having sourced this script I can use ${PVFS2_CLIENT_MOUNT_POINT}
for the client mount point.
The script also sets ${PVFS2TAB_FILE} which has to be set if
you want to use commands like pvfs2-ping for debugging.
#!/bin/sh # # PBS iozone job # # for date and time specific filenames submit this job with # # qsub -v date_and_time="`date +%b_%d_%Y__%T`" _this_script_ # # # written by: # Tony Kew # SAN Administrator # The Center for Computational Research # New York State Center of Excellence # in Bioinformatics & Life Sciences # 701 Ellicott Street, Buffalo, NY 14203 # # Office: (716) 881-8930 Fax: (716) 849-665 # #PBS -q pvfs # 1 hour walltime, 16 nodes, 2 processors per node #PBS -l walltime=01:00:00,nodes=16:ppn=2 # output filename: #PBS -o /san/user/../../${HOME}/PVFSv2/test/output/iozone_16_node_4gig_PVFSv2_scratch_simple_stripe_256K_stripe_${date_and_time}.out # merge job standard error into standard out #PBS -j oe # send email when the job terminates: #PBS -m e # job name: #PBS -N PVFSv2_iozone_test TEST_DIR=/pvfs2/scratch/iozone_test_simple_stripe_256K_stripe #TEST_SIZE=1g TEST_SIZE=4g NUMBER_OF_NODES=16 OUTPUT_EXCEL=iozone_${NUMBER_OF_NODES}_node_${TEST_SIZE}ig_PVFSv2_scratch_simple_stripe_256K_stripe_${date_and_time}.xls # iozone local binary builds: #IOZONE_BINARY=/san/fs1/util64/benchmark/iozone3_279/src/current/iozone #IOZONE_BINARY=/san/fs1/util64/benchmark/iozone-3_263/iozone3_263/src/current/iozone IOZONE_BINARY=/san/fs1/util64/benchmark/iozone3_283/src/current/iozone # iozone RPM location: #IOZONE_BINARY=/opt/iozone/bin/iozone rm -rf ${TEST_DIR} mkdir -p ${TEST_DIR} # PVFS tuning for the Test directory # - all files in the directory will be created with these parameters setfattr -n "user.pvfs2.dist_name" -v "simple_stripe" ${TEST_DIR} # note: "simple_stripe" is the default so the above is not really necessary... setfattr -n "user.pvfs2.dist_params" -v "strip_size:16384" ${TEST_DIR} # mkdir -p ${HOME}/PVFSv2/test/output # unlimited stack size ulimit -s unlimited # unlimited coredump size #ulimit -c unlimited # no coredumps ulimit -c 0 cd ${PBSTMPDIR} IOZONE_LOCAL_BINARY="${PBSTMPDIR}/`basename ${IOZONE_BINARY}`" > iozone_nodelist_$$ cat ${PBS_NODEFILE} | sort -r | uniq | while read node do rcp ${IOZONE_BINARY} ${node}:${PBSTMPDIR} #scp ${IOZONE_BINARY} ${node}:${PBSTMPDIR} echo "${node} ${TEST_DIR} ${IOZONE_LOCAL_BINARY}" >> iozone_nodelist_$$ done # in case the head node is not in the nodelist (it should be) cp ${IOZONE_BINARY} ${PBSTMPDIR} ${IOZONE_LOCAL_BINARY} -RM -r 256k -s ${TEST_SIZE} -l 1 -u ${NUMBER_OF_NODES} \ -i 0 -i 1 -c -t ${NUMBER_OF_NODES} -+m iozone_nodelist_$$ -b ${OUTPUT_EXCEL} if [ -f ${OUTPUT_EXCEL} ] then #mv ${OUTPUT_EXCEL} ${PBS_O_WORKDIR} mv ${OUTPUT_EXCEL} ${HOME}/PVFSv2/test/output/ fi
Summary of Performance Tests against PVFSv2 on the fly filesystems
The following is a summary of the performance tests using a PVFSv2 on the fly filesystem
over 16 nodes, where all nodes run data server, metadata server & client.
The filesystem was mounted on the ${PBSTMPDIR}/mnt directory.
---------------------------------------------------------------------------------
Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "simple_stripe" distribution over a 256K stripe with
iozone writing 4GB of data from each client in 256K blocks.
Mean Average numbers:
| Initial write |
219,306.19 KB/sec |
| Rewrite |
130,799.13 KB/sec |
| Read |
183,249.66 KB/sec |
| Re-read |
191,565.02 KB/sec |
[Detailed test results for PVFSv2 on the fly filesystem simple_stripe tests]
Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "twod_stripe" distribution using 4 groups of 4 servers,
striping 4 x 64KB = 256KB stripe over each group, writing 8 * 256KB = 2MB to each
group in turn. iozone writing 4GB of data from each client in 256K blocks.
Mean Average numbers:
| Initial write |
343,876.68 KB/sec |
| Rewrite |
229,740.04 KB/sec |
| Read |
167,045.91 KB/sec |
| Re-read |
166,417.03 KB/sec |
[Detailed test results for PVFSv2 on the fly filesystem twod_stripe tests]
Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "varstrip_dist" distribution with iozone writing 4GB
of data from each client in 256K blocks.
The distribution was configured to write 16K to each node, thereby creating a
256K stripe over the 16 nodes:
Mean Average numbers:
| Initial write |
69,381.52 KB/sec |
| Rewrite |
69,611.84 KB/sec |
| Read |
64,910.00 KB/sec |
| Re-read |
65,248.22 KB/sec |
[Detailed test results for PVFSv2 on the fly filesystem varstrip_dist tests]
