Solaris Express ZFS root install

Posted by Dick on June 06, 2008

Finally!

 

About frigging time

 

ZFS root has been in Indiana (aka OpenSolaris 2008.05) for a while, but I prefer Solaris Express.
As of build 90, it’s supported by the installer.

I installed it on my crappy P4 test box : 1Gb Ram, twin 40Gb disks. Burn the DVD ISO and boot it if you want to play along.

the secret handshake

Choose ‘Solaris Express(not ‘Solaris Express Developer Edition), then ‘3 . Solaris Interactive Text (Desktop Session)

Only the ‘Interactive Text’ options have the ZFS root option.
Running ‘Desktop’ not ‘Console’ session lets you start a Terminal
to enable compression on the pool when its created (40Gb disks, remember?).

Enabling ZFS compression won’t convert blocks that have already been written (good explanation here), so you want to do it before you populate the filesystem.

you know the drill

  • Choose ‘English’
  • [X starts up]
  • rightclick the desktop and choose ‘programs -> terminal’
  • system id
    • networked: yes
    • DHCP: yes
    • IPv6: no
    • [it'll do a DHCP request]
    • Kerberos: no
    • Name service : None (no need if you’re on DHCP)
    • NFS domain : Use NFSv4 domain derived by the system
    • Time Zone : Europe -> Britain (UK)
    • root password : ‘secret’ (heh)
  • F2. Standard install
  • Manually eject DVD
  • Manually reboot
  • Choose Media : CD/DVD
  • Accept license
  • Geographic regions – leave all blank
  • System Locale : POSIX C (C)
  • Web Start : None
  • Choose Filesystem Type : ZFS
  • Select Software : Entire Distribution
  • choose both disks (this makes a ZFS mirrored pool)
  • select ‘put /var on a separate dataset’ (personal choice, but stops / filling up)
  • Skip ‘Mount Remote File Systems’

fingers on buzzers

In the Terminal you opened, create a script called ‘readysetgo.sh’


#! /bin/sh
until [ "`zpool list rpool`" ];
do
:
done
zfs set compression=on rpool
until [ "`zfs list rpool/ROOT`" ];
do
:
done
zfs set compression=on rpool/ROOT

Then just run

sh readysetgo.sh

in the Terminal you opened earlier.
You can now start the install. Once the pool is created, it’ll have compression
enabled automatically.

scrooged

Let’s see how much benefit we got. The ‘Entire Distribution’ took about
5Gb of disk without compression, looks to be about 3 Gb with..

vera:~ $ zfs get -r compressratio rpool
NAME                    PROPERTY       VALUE                   SOURCE
rpool                   compressratio  1.62x                   -
rpool/ROOT              compressratio  1.74x                   -
rpool/ROOT/snv_90       compressratio  1.74x                   -
rpool/ROOT/snv_90/var   compressratio  2.60x                   -
rpool/dump              compressratio  1.00x                   -
rpool/swap              compressratio  1.37x                   -
vera:~ $

(UPDATED - thanks to Glenns blog for the neater script, and Andrew in the comments for tidying it up further)

ZFS, Leopard and baseless speculation

Posted by Dick on November 09, 2007

Just upgraded my work mac to Leopard. Took an hour, worked flawlessly.

Of course, the first thing I did was stick in a USB stick with holds half a zpool from my Solaris Express box at home:

  planb:~ $ zpool import
    pool: sticky
      id: 4692054964394431575
   state: FAULTED
  status: The pool is formatted using an incompatible version.
  action: The pool cannot be imported.
  Access the pool on a system running newer
    software, or recreate the pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-A5
config:

    sticky          UNAVAIL   newer version
      mirror        DEGRADED
        dsk/c6t0d0  UNAVAIL   cannot open
        disk3       ONLINE
planb:~ $ zpool import sticky
cannot import 'sticky': pool is formatted using a newer ZFS version

i.e. ” I know this is half of a mirror, but it’s a newer ZFS version than Apples”.

what’s new pussycat?

So I reformatted the disk on my Solaris 10 update 4 box:

vera / # rmformat
Looking for devices...
...
...
     3. Volmgt Node: /vol/dev/aliases/rmdisk0
        Logical Node: /dev/rdsk/c4t0d0p0
        Physical Node: /pci@0,0/pci8086,4c43@1d,7/storage@7/disk@0,0
        Connected Device: Kingston DataTraveler 2.0 PMAP
        Device Type: Removable
vera / # zpool create sticky c4t0d0p0

This is how it looks on Solaris 10:

vera / # zpool status sticky
  pool: sticky
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        sticky      ONLINE       0     0     0
          c4t0d0p0  ONLINE       0     0     0

errors: No known data errors
vera / # zfs list /sticky
NAME     USED  AVAIL  REFER  MOUNTPOINT
sticky    85K  1.87G  24.5K  /sticky

So I copied a bit of data on, zpool exported and stuck it back in the Mac.

planb:~ $ zpool import sticky
planb:~ $ zpool status
  pool: sticky
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on older software versions.
 scrub: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    sticky      ONLINE       0     0     0
      disk3     ONLINE       0     0     0

errors: No known data errors
planb:~ $ zfs list
NAME     USED  AVAIL  REFER  MOUNTPOINT
sticky  30.8M  1.84G  30.7M  /Volumes/sticky
planb:~ $ ls -l /Volumes/sticky/
total 3
dr-xr-xr-x+ 4 root  sys  6  7 Nov 00:10 local
planb:~ $ touch /Volumes/sticky/bummer
touch: /Volumes/sticky/bummer: Read-only file system

Much better.

pussy galore

Now for the crazy conspicacy theory bit:

planb:~ $ zpool get all sticky
cannot get property 'name': pool must be upgraded to support pool properties
cannot get property 'bootfs': pool must be upgraded to support pool properties
cannot get property 'bootfs': pool must be upgraded to support pool properties

i.e. “this zpool format doesn’t support bootable volumes. but I do.

ZFS rolling snapshots

Posted by Dick on October 12, 2007

I’ve got more zones than I know what to do with now, but they’re all interdependant.

I need to snapshot them all at the same time if I want to be able to rollback consistently.

Since it’s a backup it needs as few moving parts as possible.

I’m really more of a perl guy

As you saw yesterday, my shell scripting is pretty bloody awful.
But here’s a dead easy way to get daily snapshots across an entire storage pool.

   vera bin # cat /opt/local/bin/citizensnaps
   #! /bin/sh

   ZPOOL=$1

   if [ ! -n "$ZPOOL" ] ; then
           echo 'Usage: $0 poolname' ; exit
   fi

   # check we have a pool with that name
   /usr/sbin/zpool status $1 > /dev/null || exit

   TODAY=`date +%A`

   # get rid of last weeks snapshot, and create a fresh one
   pfexec zfs destroy -r ${ZPOOL}@${TODAY}
   pfexec zfs snapshot -r ${ZPOOL}@${TODAY}

Then run it out of roots crontab (or someone else with the correct rights profile) every night:

    # daily snapshot of everything
    59 23 * * * /opt/local/bin/citizensnaps tank

Bear in mind this is a dirt-simple, ‘works for me’ solution. If you want something more polished, look at Tim Fosters excellent work  (which is making its way into Opensolaris ).

fast zone cloning on Solaris 10

Posted by Dick on October 11, 2007

Glassfish seems like a natural successor to Tomcat.
The clustering features look interesting, but I only have the one machine.

Hmm. I’m going to need a shitload of zones.

send in the clones

The ‘zoneadm clone’ command creates a zone by copying an existing zonepath (to avoid going through the install twice).
On Solaris Express, zones on ZFS can be cloned in about a second
Solaris 10 (update4) has to actually copy the files, so we’ll use a trick to avoid that.

the master plan

  • build 1 ‘template’ zone on ZFS
  • configure it to a ‘standard build’
  • take a ZFS snapshot of the zonepath
  • ZFS clone the snapshot N times to make N zonepaths
  • run zonecfg and hook up each zonepath
  • boot them
  • ssh in and install whatever you like

build your template zone

We’ll quickly make a bog-standard
‘whole root’
zone .

This takes more disk (and longer to install) than a sparse zone,
but gives you maximum flexibility (you can write to /usr, etc.).

zfs create -o mountpoint=/zones vera/zones
zfs create -o compression=on vera/zones/template
zonecfg -z template "create -b; \
    set zonepath=/zones/template ;\
    commit ; exit"
chmod 700 /zones/template/
time zoneadm -z template install

As I said, that takes a while (a sparse zone installs in about 5 minutes):

real    21m30.749s
user    1m18.566s
sys     3m35.917s

Good job we only have to do it once.

tweak it like you mean it

You could clone the zonepath now (skip ahead to ’say cheese’), but
since I tend to setup my machines the same way, I’ll customize things first.

First thing to do is boot the zone, and complete the system identification.

zoneadm -z template boot
zlogin -C -e. template

The zlogin command means :

  • get me a console (-C) login to do system setup
    • sysconfig runs on the zone console, so a straight zlogin isn’t enough
  • type ’..’ (-e.) to be dropped back to the main zone
    • the default sequence is .#, which will kill your ssh session to the global zone

You’ll see a counter as the SMF database is generated on first boot
(which takes a few minutes. again, we only need to do this in the template)::

[Connected to zone 'template' console]
 37/138

Then go through the standard Solaris sysconfig
(doesn’t matter what you enter – this is overridden on a per-zone basis).

When that’s done, the zone will reboot itself (hit ’..’ to exit zrogin).

Now do your ‘standard build’. My list :

  • change roots shell and prompt
  • copy my public SSH keys so I can ssh in as root
  • setup sendmail
  • turn off some daemons

Since that’s what I did for my original solaris install
I can just copy files to do most of this.

zlogin template usermod -s /usr/bin/bash root
cp /.bash_profile /zones/template/root/
cp /etc/ssh/sshd_config /zones/template/root/etc/ssh/sshd_config
cp -Rp /.ssh/ /zones/template/root/.ssh/
cp /etc/mail/sendmail.cf /zones/template/root/etc/mail/sendmail.cf
cp /etc/mail/aliases /zones/template/root/etc/mail/aliases
cp /etc/mail/aliases.db /zones/template/root/etc/mail/aliases.db
for i in webconsole sendmail autofs
do
zlogin template svcadm disable $i
done

say cheese

     zlogin template
     # sys-unconfig # this also halts the 'template' zone
     zoneadm -z template detach
     zfs snapshot vera/zones/template@clean
     zoneadm -z template attach

(the last ‘attach’ command makes patching the zone slighty easier).

going around the houses

Now we can use that to create a new zonepath for our DB zone, ganesh:

zfs clone vera/zones/template@clean vera/zones/ganesh

Life is a LOT easier if you separate your OS from your data, so I also give the zone its own ZFS filesystem – what we call ‘delegating a dataset’ – to install
its apps etc on
(note that although the zonepath is on ZFS, the zone is not ‘aware’ of that, so you can’t create zfs filesystems on it).
This also lets zone admins run their own snapshots etc. (snapping from the global zone works too, so choose your preference)

zfs create -o mountpoint=none vera/delegated/ganesh
zfs set quota=5G vera/delegated/ganesh

zonecfg supports ‘create -a’ to attach a pre-built zoneroot and generate a
config for it. We also

  • set it to boot at system startup (’autoboot’)
  • add a network address (’add net’)
  • apply some simple resource controls (’add cpu-shares/max-lwps/capped-memory’)
    zonecfg -z ganesh "create -a /zones/ganesh;set autoboot=true; \
    add net; set physical=iprb0; set address=10.1.0.1/24; end; \
    set cpu-shares=20; set max-lwps=400; \
    add capped-memory; set physical=400m; set swap=500m; end; \
    add dataset ; set name=vera/delegated/ganesh; end; \
    commit; exit"
    zoneadm -z ganesh attach

feed some prepared answers to sysconfig:

sed s/ZONENAME/ganesh/ \
/zones/scripts/sysidcfg.template > /zones/ganesh/root/etc/sysidcfg

and finally boot it

zoneadm -z ganesh boot

attack of the clones

That’s the database taken care of.
We now have 3 more to do, and this is pretty easy to script.
I threw something together to do the job for me.
It’s pretty stinky (I don’t really speak shell) but should be easy for you to roll your own
You’ll need the script and the template for sysidcfg

cd /zones/scripts
wget http://files.hellooperator.net/solaris/zones/s10/scripts/bang_one_out.s10u4.sh
wget http://files.hellooperator.net/solaris/zones/s10/scripts/sysidcfg.template

Now the payoff:

time for i in kingfish rippyfish turnipfish
 do
   /zones/scripts/bang_one_out.s10u4.sh $i
 done
real    0m14.409s
user    0m2.459s
sys     0m1.097s
zoneadm list -iv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   6 ganesh           running    /zones/ganesh                  native   shared
  25 kingfish         running    /zones/kingfish                native   shared
  27 rippyfish        running    /zones/rippyfish               native   shared
  29 turnipfish       running    /zones/turnipfish              native   shared

did you see that?

That’s 15 SECONDS to do what took 20 minutes the first time. Except these zones are configured and booted ready to ssh into.

Oh, and there are 3 of them.

I use zone cloning like Jumpstart – a way to
get a known, reproducible base OS as a building blocks for other things.

You can clone zones whatever FS they’re on, but it will take
longer to copy files than to snapshot+clone (especially for whole root zones).

The great thing about ZFS snapshots and clones is that a clone only uses disk space for the changes from its parent snapshot. It’s not obvious at the filesystem level:

du -hs  /zones/template /zones/ganesh
 2.1G   /zones/template
 2.3G   /zones/ganesh

But you can see it in the dataset (the ‘USED’ field below):

zfs list  vera/zones/template vera/zones/ganesh
NAME                  USED  AVAIL  REFER  MOUNTPOINT
vera/zones/ganesh    35.1M  28.6G  2.11G    /zones/ganesh
vera/zones/template  2.13G  28.6G  2.10G  /zones/template

Finally, remember you can clone any zone.
A common
problem we have is our test and dev. systems getting out of step with our production
boxes. If they’re zones
(and they will be if I have a say in it), you can easily clone
the live box (and its database zone) to get a testbed for upgrades, config changes, etc. that is as close to reality as you can get.

Solaris 10 on mirrored disks

Posted by Dick on September 27, 2007

Solaris 10update 4 is out, and so is glassfish v2. First we need to get our
OS on.

My test x86 machine is a 3Ghz P4 with 1Gb RAM and twin 40Gb disks.
Disks are a bit pokey, but having 2 makes playing around with RAID and ZFS more fun.

Since ZFS root isn’t here yet, I’ll use Solaris Volume Manager (SVM) to mirror the root
filesystem. Applications, /export/home , etc. will live on a ZFS mirrored pool.

(NB: the procedure to install Solaris Express is almost identical, except you can skip the PCA step)

sunshine in a bag

I got the Solaris 10 Update 4 DVD ISO and burnt it off.
The install is straightforward, with a couple of caveats:

  • SVM can only mirror slices on solaris fdisk partitions, so make 1 big solaris primary partition.
  • only install onto the first disk (c1d0) – we’ll add the second one later.
  • choose ‘custom install’ to choose your disk layout
slice file system size notes
0 / 6000Mb
1 swap 1100Mb (must be bigger than RAM to save crashdumps)
3 /metadb 10Mb (this is just to reserve the space for SVM bookkeeping)
7 /zfs 32035Mb (the rest of the disk will be a ZFS storage pool)

Note I haven’t set up a slice for
Live Upgrade
. I’ll detach one submirror before an upgrade, then I can rollback or keep the upgrade by choosing which way to resync them afterwards.

I chose ‘Entire Distribution’, then went off to find a sandwich and play a bit of Hotel Dusk.

After the reboot, you can login as root,
unmount /metadb (c1d0s3) and /zfs (c1d0s7) , remove them from /etc/vfstab,
and delete the mountpoints (you could just set them up later, but the installer is a bit eaiser to explain than the ‘format’ command).

slice up the second disk

We’ll set the second disk to have 1 Solaris fdisk partition.
Pipe the disklabel from c1d0 onto c2d0 so the slice sizes on both are identical:

  fdisk -B /dev/rdsk/c2d0p0
  prtvtoc /dev/rdsk/c1d0s2 | fmthard -s - /dev/rdsk/c2d0s2

We also need to install grub, so it’s bootable if the first disk dies:

/sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c2d0s0
  stage1 written to partition 0 sector 0 (abs 16065)
  stage2 written to partition 0, 260 sectors starting at 50 (abs 16115)

Add an entry for c2d0 in /boot/grub/menu.lst:

  # second half of SVM root mirror
  title alternate root
  root (hd1,0,a)
  kernel /platform/i86pc/multiboot
  module /platform/i86pc/boot_archive

  title alternate root failsafe
  root (hd1,0,a)
  kernel /boot/multiboot kernel/unix -s
  module /boot/x86.miniroot-safe

setting up the state databases

SVM stores its config on-disk, in
state database replicas .
You need half of them to be online at any given time, which means I
need 2 copies on each disk (each is about 4Mb, hence the 10Mb /metadb slice I set aside):

metadb -a -f -c 2 c1d0s3 c2d0s3

which says :

  • add some state database replicas (-a)
  • it’s ok that there aren’t any existing replicas (-f)
  • there’ll be 2 database replicas on each device (-c 2)
  • and use the slices we set aside earlier (c1d0s3 c2d0s3)

Check they got created OK:

metadb
      flags           first blk       block count
   a        u         16              8192            /dev/dsk/c1d0s3
   a        u         8208            8192            /dev/dsk/c1d0s3
   a        u         16              8192            /dev/dsk/c2d0s3
   a        u         8208            8192            /dev/dsk/c2d0s3

The ‘u’ flag means the replica is up to date (’metadb -i’ gives a legend).

setting up the root RAID-1 mirror

I’ll use my existing root fs as one submirror, then hook up the second disk.

First we tell SVM about the (existing) root slice:

metainit -f d1 1 1 c1d0s0
  d1: Concat/Stripe is setup

which says:

  • make a volume called d1 (d1)
  • with one stripe (1)
  • with one component per stripe (1)
  • out of my existing root slice (c1d0s0)
  • oh, and yes, I know it contains a filesystem (-f)

We do the same thing for the second disks root slice (this is empty, so we don’t need ’-f’):

metainit d2 1 1 c2d0s0
   d2: Concat/Stripe is setup

Now we create a mirror volume made up of the populated submirror, d1:

metainit d0 -m d1
   d0: mirror is setup

which says:

  • make a volume called d0 (d0)
  • which is a mirror made up of volume d1 ( -m d1)

I’.ll start using this volume as the root fs
before I attach the other submirror (if you’re going to fail, fail early).
The ‘metaroot’ command edits /etc/vfstab and /etc/system for you:

metaroot d0
reboot

And when it comes back up, we’re running on the logical device:

df -h /
  Filesystem             size   used  avail capacity  Mounted on
  /dev/md/dsk/d0         5.8G   3.1G   2.6G    56%    /

Last thing to do is attach the other half of the mirror:

metattach d0 d2

You can watch the mirror syncing up:

metastat -c
  d0               m  5.9GB d1 d2 (resync-15%)
      d1           s  5.9GB c1d0s0
      d2           s  5.9GB c2d0s0

Takes about 5 minutes, and that’s pretty much it.

multi-mirror swap shop

Up to you whether to do this – you can use the second swap device for more VM,
but mirroring should help if a disk dies while you’re running.
The process is very similar to the root slice:

metainit -f d51 1 1 c1d0s1
metainit d52 1 1 c2d0s1
metainit d50 -m d51
metattach d50 d52
swap -d /dev/dsk/c1d0s1
swap -a /dev/md/dsk/d50

Update /etc/vfstab to use /dev/md/dsk/d50 instead of /dev/dsk/c0d0s1

Setup the ZFS mirror

I want a ZFS mirror for home directories, apps, etc.
It’s not that I don’t trust SVM (although I don’t know it yet),
but it’s just a volume manager – you still have all the hassles of filesystems
on top of it, and if I wanted that I’d still be on Linux LVM.

zpool create tank mirror c1d0s7 c2d0s7
zpool status
    pool: tank
   state: ONLINE
   scrub: none requested
  config:
NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0s7  ONLINE       0     0     0
            c2d0s7  ONLINE       0     0     0
errors: No known data errors

And that’s it.

Well, actually, no.

The next thing to do is pull out disks and check you can still boot
the machine. But this is getting a bit long-winded now, so that’ll be another
post.

no, honestly, you can stop reading now

My post-install checklist includes:

  • hooking the machine up for outbound email
      echo DSsmarthost.whatever.com >> /etc/mail/sendmail.cf
      echo 'root: me@whatever.com >> /etc/mail/newaliases
      svcadm restart sendmail
      newaliases
  • hardcode duplex settings
  • hook up pca
  • setup a firewall
  • setup NTP
      echo 'server time.apple.com' > /etc/inet/ntp.conf
      ntpdate -b time.apple.com
      svcadm enable ntp
  • create a user
      zfs create -o mountpoint=/export/home tank/home
      zfs create tank/home/dick
      useradd -c 'Dick Davies' -d /export/home/dick -s /usr/bin/zsh dick
      projadd -c 'Dick Davies' user.dick
      chown -R dick /export/home/dick
      passwd dick
  • switch on the (zone-friendly) Fair Share Scheduler
      dispadmin -d FSS
      reboot

It would be nice to Jumpstart this, and once we get a decent PXE solution that’ll be exactly what I’ll do. This will help no end.