High Performance Computing
High End Visualization
About CCR Contact Us Events Facilities Faculty Project Highlights History Job Opportunities News Partners Staff
Bioinformatics Consulting Services Grid Computing Visualization
Accounts Application Software Consulting Getting Started Hardware Resources Frequently Asked Questions Machine Status MyStats Overview Request Help Storage Resources Training/Courses Web Portals
Access Grid Training Outreach
Events Logos Media News Photo Album Videos
Contact Information Directions
  Dashboard > CCR Web > ... > Batch computing > PVFS version 2 scratch filesystem on U2
Log In   View a printable version of the current page.
PVFS version 2 scratch filesystem on U2

PVFSv2 Documentation

PVFS version 2 is currently under test on the u2 cluster, and is only available to specific users who have access to the "pvfs" queue.

There is a PVFSv2 filesystem spanning nodes c29n17 through c29n31.
Each of these nodes is a server and a metadata server.
There is no redundancy in the configuration in either the data, or the metadata.
If any of these nodes is unavailable some, or (in practice) all of the filesystem is unavailable.

The client nodes c29n01 through c29n16 mount this PVFS filesystem on boot
on the directory:

/pvfs2/scratch

You can set parameters on a directory such that all files subsequently
created in that directory will use that data distribution e.g.:

c29n01$ mkdir /pvfs2/scratch/somedir
c29n01$ setfattr -n "user.pvfs2.dist_name" -v "twod_stripe" \
/pvfs2/scratch/somedir
c29n01$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:8192" \
/pvfs2/scratch/somedir
c29n01$

You can read the parameters set on a directory as follows:

c29n01$ getfattr /pvfs2/scratch/somedir/
getfattr: Removing leading '/' from absolute path names
# file: pvfs2/scratch/somedir
user.pvfs2.dist_name
user.pvfs2.dist_params
c29n01$

i.e. user.pvfs2.dist_name and user.pvfs2.dist_params have been set on
this dir - now reading the values of those parameters:

c29n01$ getfattr -n "user.pvfs2.dist_name" /pvfs2/scratch/somedir/
getfattr: Removing leading '/' from absolute path names
# file: pvfs2/scratch/somedir
user.pvfs2.dist_name="twod_stripe"
c29n01$
c29n01$ getfattr -n "user.pvfs2.dist_params" /pvfs2/scratch/somedir/
getfattr: Removing leading '/' from absolute path names
# file: pvfs2/scratch/somedir
user.pvfs2.dist_params="strip_size:8192"
c29n01$

We are running PVFS v2 version 2.7.1 with patches to fix problems with
the basic , 2D stripe, and variable strip distributions.

Summary of the Distribution Settings:

"basic_dist" == one datafile per server
$ setfattr -n "user.pvfs2.dist_name" -v "basic_dist" dir

[no settable parameters]

Note: If you create a directory and set the distribution for the
directory to "basic_dist" all files created in that directory will
be stored on one server

"simple_stripe" == stripe over servers (all servers by default)

Note: This is the default distribution

$ setfattr -n "user.pvfs2.dist_name" -v "simple_stripe" dir

set strip size in bytes with (e.g.):

$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:4096" dir

Note: The strip size is the amount of data written to each server in a stripe,
it is NOT the stripe size i.e.
stripe size == strip_size * number of servers in the stripe

set the number of servers for the stripe (stripe = strip * servers) with (e.g.):

$ setfattr -n "user.pvfs2.num_dfiles" -v "8" dir

default strip size = 64K
...with 16 servers the stripe size is 16 * 64K = 1M stripe

"twod_stripe" == Two dimensional striping pattern over servers

This distribution is designed to combat incast

As per Kyle Schochenmaler:

The twod-stripe will take all of the servers in the filesystem and partition
them into num_groups groups. Data will then be striped to each group before we
move onto the next group. The strip_factor will determine how many chunks of
strip_size are written to each server in each group before we transition to the
next group. The striping on the group level is done round-robin in the same
fashion as simple-stripe

$ setfattr -n "user.pvfs2.dist_name" -v "twod_stripe" dir

set strip [NOT stripe] size in bytes with (e.g.):

$ setfattr -n "user.pvfs2.dist_params" -v "strip_size:32678" dir

set number of groups (e.g.):

$ setfattr -n "user.pvfs2.dist_params" -v "num_groups:4" dir

set the group strip factor (e.g.)

$ setfattr -n "user.pvfs2.dist_params" -v "group_strip_factor:8" dir

Note: These settings are NOT additive, so to set 32KB strip, 4 groups with strip factor 8:

$ setfattr -n "user.pvfs2.dist_params" \
 -v "strip_size:32768,num_groups:4,group_strip_factor:8" dir

Note: These settings are separated by commas ','

If you use the above setting on a directory in /pvfs2/scratch which has
16 servers, you set "num_groups" to 4 so you will create:

16/4 ==> 4 groups of 4 servers each.

Each server in the first group will have "strip_size" bytes, that is 32KB,
written to it; creating a stripe across these 4 servers.

stripe size == strip_size * number of servers in the stripe

... in this case:
32KB * 4 == 128KB stripe
So a 128K stripe will be written across the 4 servers in the first group.

With "group_strip_factor" set to 8, 8 of these 128KB stripes will be written
to the first group, that is:

8 * 128KB = 1024KB == 1MB of data

Then 1MB will then be written to the next group of 4 servers in the same way etc.

default strip size = 64K
default number of groups = 2
default strip factor = 256

"varstrip_dist" == control which servers to stripe across and the strip size on each server
$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dir

The format of "user.pvfs2.dist_params" (based on online examples
and pvfs2-dist-varstrip.h):
"strips:node_number:size;node_number:size[;node_number:size]...
example: "strips:0:512;1:512"
results in: node number 0: bytes 0-511
node number 1: bytes 512-1023

For example the following should give a 256K stripe over the
16 servers (the same as "simple_stripe" with "strip_size:16384")

$ setfattr -n "user.pvfs2.dist_params" -v "strips:0:16384;\
1:16384;2:16384;3:16384;4:16384;5:16384;6:16384;7:16384;\
8:16384;9:16384;10:16384;11:16384;12:16384;13:16384;14:16384;\
15:16384" dir
$

This would be more useful for creating (e.g.) two directories
with non overlapping 256K stripes, each over 8 servers like this:

$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dira
$ setfattr -n "user.pvfs2.dist_params" -v "strips:0:32768;\
1:32768;2:32768;3:32768;4:32768;5:32768;6:32768;7:32768" dira
$ setfattr -n "user.pvfs2.dist_name" -v "varstrip_dist" dirb
$ setfattr -n "user.pvfs2.dist_params" -v "strips:8:32768;\
9:32768;10:32768;11:32768;12:32768;13:32768;14:32768;\
15:32768" dirb
$

Note: The "node_number:strip_size" pairs are separated by semicolons ';'

There are no default parameters per se, but if you're going to use
this distribution you will want to set the distribution parameters
anyhow...

To use MPICH2 with ROMIO for PVFSv2 with the GNU gcc compiler you will
need to load the mpich2 module as follows:

$ module load mpich2/gcc-3.4.6/1.0.7-pvfs2

Note: Remember to add this command to your PBS job scripts to set
the runtime library path

Summary of Performance Tests against the /pvfsv2/scratch filesystems


The following is a summary of the performance tests against the PVSFv2 filesystem
served from 16 nodes, each node running as a data & metadata server.
The filesystem was mounted on 16 client (only) nodes on the /pvfsv2/scratch directory.
             ---------------------------------------------------------------------------------

Test data written to a subdirectory of /pvfsv2/scratch using the "simple_stripe"
distribution over a 256K stripe with iozone writing 4GB of data from each client
in 256K blocks.

Mean Average of test values:

Initial write
321,058.83 KB/sec
Rewrite
212,626.53 KB/sec
Read
161,611.95 KB/sec
Re-read
154,799.58 KB/sec

[Detailed test results for /pvfsv2/scratch simple_stripe tests]

Test data written to a subdirectory of /pvfsv2/scratch using the "twod_stripe"
distribution using 4 groups of 4 servers, striping 4 x 64KB = 256KB stripe over
each group, writing 8 * 256KB = 2MB to each group in turn.
iozone writing 4GB of data from each client in 256K blocks.

Mean Average of test values:

Initial write
365,058.37 KB/sec
Rewrite
243,681.57 KB/sec
Read
175,666.28 KB/sec
Re-read
171,005.77 KB/sec

[Detailed test results for /pvfsv2/scratch twod_stripe tests]

Test data written to a subdirectory of /pvfsv2/scratch using the "varstrip_dist"
distribution over a 256K stripe with iozone writing 4GB of data from each client
in 256K blocks.
The distribution was configured to write 16K to each node, thereby creating a
256K stripe over the 16 server nodes.

Mean Average of test values:

Initial write
298,962.00 KB/sec
Rewrite
197,318.79 KB/sec
Read
203,210.17 KB/sec
Re-read
186,866.10 KB/sec

[Detailed test results for /pvfsv2/scratch varstrip_dist tests]

[Full Test Results for /pvfs2/scratch tests]


PVFS v2 filesystem on-the-fly

I've been working on an option for creating a PVFSv2 filesystem on
the fly for a PBS job. The filesystem is created across all the
nodes in the job using space in the local /scratch directory
This is working in my tests when 2 processors per node are requested
for the BPS job (i.e. the nodes are dedicated to the PBS job) but I
have seen problems when running multiple jobs using one processor
per node i.e there are two PVFS filesystems created with two servers
running on each node. With one processor per node I have seen problems
mounting the on-the-fly filesystem - I'mm working on this now...

If you add the tag "PVFS" to a PBS job script a PVFS filesystem
is created across all the (compute) nodes in the job.

This tag should not be used for jobs that use the PVFSv2 filesystem
/pvfs2/scratch (served from nodes c29n17 though c29n32)

This tag is only available on nodes in the pvfs queue
(nodes c29n01 through c29n16) for now because the pvfs and
pvfs-kernelmodule RPMs are required to create & mount the
filesystem

The filesystem is mounted on the directory:

${PBSTMPDIR}/mnt

I source a bourne script which sets several PVFS environment variables:

source /var/spool/pbs/mom_priv/pvfs_globals ${PBS_JOBID}

...though this is not required to access the filesystem

Having sourced this script I can use ${PVFS2_CLIENT_MOUNT_POINT}
for the client mount point.
The script also sets ${PVFS2TAB_FILE} which has to be set if
you want to use commands like pvfs2-ping for debugging.

Sample PBS job script to run iozone against a PVFSv2 "on the fly" filesystem mounted on ${PVFS2TAB_FILE}
#!/bin/sh
#
# PBS iozone job
#
# for date and time specific filenames submit this job with
#
#    qsub -v date_and_time="`date +%b_%d_%Y__%T`" _this_script_
#
#
# written by:
#               Tony Kew
#               SAN Administrator
#               The Center for Computational Research
#               New York State Center of Excellence
#                in Bioinformatics & Life Sciences
#               701 Ellicott Street, Buffalo, NY 14203
#               
#               Office: (716) 881-8930           Fax: (716) 849-665
#
#PBS -q pvfs
# 1 hour walltime, 16 nodes, 2 processors per node
#PBS -l walltime=01:00:00,nodes=16:ppn=2
# output filename:
#PBS -o /san/user/../../${HOME}/PVFSv2/test/output/iozone_16_node_4gig_PVFSv2_scratch_simple_stripe_256K_stripe_${date_and_time}.out
# merge job standard error into standard out
#PBS -j oe
# send email when the job terminates:
#PBS -m e
# job name:
#PBS -N PVFSv2_iozone_test

TEST_DIR=/pvfs2/scratch/iozone_test_simple_stripe_256K_stripe
#TEST_SIZE=1g
TEST_SIZE=4g
NUMBER_OF_NODES=16

OUTPUT_EXCEL=iozone_${NUMBER_OF_NODES}_node_${TEST_SIZE}ig_PVFSv2_scratch_simple_stripe_256K_stripe_${date_and_time}.xls

# iozone local binary builds:
#IOZONE_BINARY=/san/fs1/util64/benchmark/iozone3_279/src/current/iozone
#IOZONE_BINARY=/san/fs1/util64/benchmark/iozone-3_263/iozone3_263/src/current/iozone
IOZONE_BINARY=/san/fs1/util64/benchmark/iozone3_283/src/current/iozone
# iozone RPM location:
#IOZONE_BINARY=/opt/iozone/bin/iozone

rm -rf ${TEST_DIR}
mkdir -p ${TEST_DIR}
# PVFS tuning for the Test directory
# - all files in the directory will be created with these parameters
setfattr -n "user.pvfs2.dist_name" -v "simple_stripe" ${TEST_DIR}
# note: "simple_stripe" is the default so the above is not really necessary...
setfattr -n "user.pvfs2.dist_params" -v "strip_size:16384" ${TEST_DIR}
#

mkdir -p ${HOME}/PVFSv2/test/output

# unlimited stack size
ulimit -s unlimited
# unlimited coredump size
#ulimit -c unlimited
# no coredumps
ulimit -c 0

cd ${PBSTMPDIR}

IOZONE_LOCAL_BINARY="${PBSTMPDIR}/`basename ${IOZONE_BINARY}`"
> iozone_nodelist_$$
cat ${PBS_NODEFILE} | sort -r | uniq | while read node
do
  rcp ${IOZONE_BINARY} ${node}:${PBSTMPDIR}
  #scp ${IOZONE_BINARY} ${node}:${PBSTMPDIR}
  echo "${node} ${TEST_DIR} ${IOZONE_LOCAL_BINARY}" >> iozone_nodelist_$$
done

# in case the head node is not in the nodelist (it should be)
cp ${IOZONE_BINARY} ${PBSTMPDIR}

${IOZONE_LOCAL_BINARY} -RM -r 256k -s ${TEST_SIZE} -l 1 -u ${NUMBER_OF_NODES} \
 -i 0 -i 1 -c -t ${NUMBER_OF_NODES} -+m iozone_nodelist_$$ -b ${OUTPUT_EXCEL}

if [ -f ${OUTPUT_EXCEL} ]
then
  #mv ${OUTPUT_EXCEL} ${PBS_O_WORKDIR}
  mv ${OUTPUT_EXCEL} ${HOME}/PVFSv2/test/output/
fi

Summary of Performance Tests against PVFSv2 on the fly filesystems


The following is a summary of the performance tests using a PVFSv2 on the fly filesystem
over 16 nodes, where all nodes run data server, metadata server & client.
The filesystem was mounted on the ${PBSTMPDIR}/mnt directory.
             ---------------------------------------------------------------------------------

Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "simple_stripe" distribution over a 256K stripe with
iozone writing 4GB of data from each client in 256K blocks.

Mean Average numbers:

Initial write
219,306.19 KB/sec
Rewrite
130,799.13 KB/sec
Read
183,249.66 KB/sec
Re-read
191,565.02 KB/sec

[Detailed test results for PVFSv2 on the fly filesystem simple_stripe tests]

Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "twod_stripe" distribution using 4 groups of 4 servers,
striping 4 x 64KB = 256KB stripe over each group, writing 8 * 256KB = 2MB to each
group in turn. iozone writing 4GB of data from each client in 256K blocks.

Mean Average numbers:

Initial write
343,876.68 KB/sec
Rewrite
229,740.04 KB/sec
Read
167,045.91 KB/sec
Re-read
166,417.03 KB/sec

[Detailed test results for PVFSv2 on the fly filesystem twod_stripe tests]

Test data written to a subdirectory of the PVFSv2 on the fly filesystem mounted on
${PBSTMPDIR}/mnt using the "varstrip_dist" distribution with iozone writing 4GB
of data from each client in 256K blocks.
The distribution was configured to write 16K to each node, thereby creating a
256K stripe over the 16 nodes:

Mean Average numbers:

Initial write
69,381.52 KB/sec
Rewrite
69,611.84 KB/sec
Read
64,910.00 KB/sec
Re-read
65,248.22 KB/sec

[Detailed test results for PVFSv2 on the fly filesystem varstrip_dist tests]

[Full Test Results for PVFSv2 on the fly filesystem tests]

Center for Computational Research - University at Buffalo - State University of New York