Automatically Deduplicating Data on Debian Linux

Deduplication is the process by which a filesystem (or application) stores data by first comparing data blocks within the data and then only storing one copy of matching data blocks. By doing this, the files require significantly less space on the storage medium.

A good example (just for the sake of understanding) would be WordPress. Let’s say you have a web server with 10 WordPress sites. The WordPress source code in this example is 30 MB on its own. Your server will be storing 300 MB (10x 30 MB). By storing this on a deduplicating filesystem, it’ll be the original 30 MB plus a little overhead for the deduplication data… so let’s say for the sake of ease, your server will be storing just 31 MB for exactly the same data.

These are small numbers. But I recently opened an off-site backup service for NEMS Linux, and I need to be able to store daily backups for its users. Guess what? From day-to-day, a significant portion of those backups are very, very similar. Config files don’t generally change much from day-to-day, most days. So why store them in such a way that they take up 30x the space? Deduplicating is going to save me a ton of storage space.

I’ve been reading up on some deduplication options. My first go-to was btrfs, but it looks like they’re not quite ready yet, with inline deduplication residing only out of tree. I feel like when that feature is implemented in stable, btrfs will be my go-to… but for now, I need to find an alternate solution.

Lessfs is another one I peeked at, but once I noticed their “official” web site was offline, and distribution is done through Sourceforge, I moved on pretty quickly as it seems pretty obvious that either it’s a dead project or at least not a well-supported one.

Then I got looking at OpenDedup’s SDFS, which is a volume-based deduping filesystem, which sounds ideal for my use case, for now. I won’t hold the fact that it is Java-based against it just now as the functionality sounds perfect. Plus SDFS appears well-supported and professional in its presentation, which gives me hope for its future.

I’m going to add some more memory to my little server to accommodate the RAM requirements. Make sure your system has adequate RAM… SDFS likes to eat memory for breakfast. “The SDFS Filesystem itself uses about 3GB of RAM for internal processing and caching. For hash table caching and chunk storaged kernel memory is used. It is advisable to have enough memory to store the entire hashtable so that SDFS does not have to scan swap space or the file system to lookup hashes. To calculate memory requirements keep in mind that each stored chunk takes up approximately 256 MB of RAM per 1 TB of unique storage.” [Admin Guide]

If you’re not using Debian, check out their Quickstart Guide.

Installation of SDFS and its dependencies on my Debian system (would also work for any other Debian-based system like Ubuntu, as long as you are root user):

apt -y install libxml2-utils
wget -O /tmp/sdfs.deb http://www.opendedup.org/downloads/sdfs-latest.deb
dpkg -i /tmp/sdfs.deb

Next up, we need to increase the limit of how many files can be opened at once… again, as the root user:

echo "* hard nofile 65535" >> /etc/security/limits.conf
echo "* soft nofile 65535" >> /etc/security/limits.conf

Next up, I need to create the volume itself, but I want to be specific about where it is stored. In this example I will call the volume “myvolume” and I will store it in a folder called raw_volume in my home folder… this way I know not to touch it (as it is raw):

mkfs.sdfs --hash-type=VARIABLE_MURMUR3 --volume-name=myvolume --volume-capacity=100GB --base-path=/home/robbie/raw_volume

Once created, if you’d like to see the status, type:

sdfscli --volume-info

…and you can view/edit the configuration in the file /etc/sdfs/myvolume-volume-cfg.xml where myvolume is whatever you named yours with –volume-name above.

The reason I’m specifying to store in my home folder is because it will then be part of my backup set (without having to manually add it) and also because my home folder is on a different, bigger drive than the /opt folder, which is where SDFS would default to.

You’ll also notice in the above command I’ve set the capacity to 100GB. It won’t actually take this much space on my drive right now. That is the maximum I’m allowing the volume to become. You can change that to anything you like, to suit your need. On the disk itself (in /home/robbie/raw_volume by my example) the SDFS volume will actually only take up the amount of space of the deduplicated data. If you ever need to make the volume bigger, you can do so by typing the following with the volume unmounted: sdfscli –expandvolume 512GB

Also, since this is a local filesystem, I’ve specified to use a variable block size, which could reduce the amount of space and improve the deduplication.

Now I need to create the mountpoint and mount the SDFS volume so I can start writing data to it:

mkdir /home/robbie/backup
chattr +i /home/robbie/backup

Now let’s prepare the mount.sdfs command:

nano /sbin/mount.sdfs

Scroll to the end of the file and remove “-Xmx$MEMORY$MU”, and edit “-Xms$MEMORY$MU” to instead read “-Xms1M”.

So my final command looks like this:

LD_PRELOAD="${BASEPATH}/bin/libfuse.so.2" $EXEC -server -outfile '&1' -errfile '&2' -Djava.library.path=${BASEPATH}/bin/ -home ${BASEPATH}/bin/jre -Dorg.apache.commons.logging.Log=fuse.logging.FuseLog -Xss2m \
 -wait 99999999999 -Dfuse.logging.level=INFO -Dfile.encoding=UTF-8 -Xms1M \
-XX:+DisableExplicitGC -pidfile /var/run/$PF -XX:+UseG1GC -Djava.awt.headless=true \
 -cp ${BASEPATH}/lib/* fuse.SDFS.MountSDFS "$@"

Then, mount it to test:

mount.sdfs myvolume /home/robbie/backup

Try writing some data to the the mountpoint. If all went well, all should work and automatically dedupe. As I write data to /home/robbie/backup, it is automatically deduplicated to save space!

Next up, adding it to fstab!

If all went well, unmount it and make it so it automounts.

umount /home/robbie/backup

Despite what some people are saying online, yes, you can indeed mount sdfs filesystems using fstab! It’s a fuse-based filesystem! #facepalm

Here’s how I added it to my fstab:

myvolume /home/robbie/backup sdfs defaults,noatime,rw,x-systemd.device-timeout=5 0 0

All is working great, but it’ll be most interesting to see what begins happening once I exceed ~1 GB storage and deduplication starts doing its thing.

The Results

To see the difference in usage, I like simply using this command:

ls -lskh

This will output something along the lines of this:
47K -rw-r–r– 1 root root 1.6M Feb 2 10:02 test2.txt
1.5M -rw-r–r– 1 root root 1.6M Feb 2 09:55 test1.txt

You’ll notice I’ve colorized the filesizes. The first set (indicated in orange) represents the actual usage on disk thanks to deduplication. The second set (blue) is the actual filesize.

I even noted that copying multiple copies of the same file, the “extra” copies showed a use on disk size of 0B! Yes, the impact is so small it didn’t even register! Brilliant.

Backup a Linux machine with LVM Snapshots and rdiff-backup

Here is the completed script I wrote on Episode 461. Make sure you check out the full episode for details on how to make this work for you.

#!/bin/sh
/sbin/lvcreate -L10G -s -n lvm_snapshot /dev/ubuntu-mate-vg/root
/bin/mount /dev/ubuntu-mate-vg/lvm_snapshot /mnt/snapshot

/usr/bin/rdiff-backup -v5 --print-statistics \
  --exclude /mnt/backup/ \
  --include /mnt/snapshot/home/ \
  --include /mnt/snapshot/etc/fstab \
  --include /mnt/snapshot/var/log/ \
  --exclude '**' \
  / \
  /mnt/backup/

/bin/umount /mnt/snapshot
/sbin/lvremove -f /dev/ubuntu-mate-vg/lvm_snapshot

And of course, here is the episode:

Plex Media Server on a Raspberry Pi 3

I wanted to document the instructions shared on Episode 459 to supplement the episode.

On the show, Jeff and I demonstrated how to turn a Raspberry Pi 3 with Raspbian Jessie into a Plex Media Server, giving you the chance to stream your entire video and music library to all your devices.

I won’t get into the full details here, since this is only a supplement to give you some copy-and-paste instructions, but I’d encourage you to watch the video.

What You Need

  1. A Raspberry Pi 3 Micro Computer. Please consider purchasing it through our store to support what we do: https://cat5.tv/pi
  2. Raspbian Jessie – A free download from raspberrypi.org
  3. Obvious stuff like a good MicroSD card, Ethernet cable (preferred as opposed to wifi), keyboard and mouse… etc.

How to Do The Do
Updated February 7, 2018
due to some evolution of the process. These steps are more current than those used in the video (a new video will be coming soon).

  1. In terminal, upgrade your distro to the latest and greatest.
    sudo apt update
    sudo apt upgrade
    sudo apt dist-upgrade
  2. Reboot the Pi.
    sudo reboot
  3. Add the ability for apt to use https repositories. If you already have this, it’ll report as “already the current version” and you can move on.
    sudo apt install apt-transport-https
  4. Add the Plex Media Server repository provided by Universität Leipzig.
    echo "deb https://dev2day.de/pms/ jessie main" | sudo tee /etc/apt/sources.list.d/pms.list
  5. Add the GPG key for the repository.
    This is the “easy” method (which didn’t work for us because my keyboard was in some weird mode with no pipe character):

    wget -O - https://dev2day.de/pms/dev2day-pms.gpg.key | sudo apt-key add -

    Alternate method (which I had to use on the show since I didn’t have a pipe character… I’ve cleaned it up a bit since the live show so it is cleaner since it was an unexpected twist and I kinda made it seem more confusing than it should):

    wget -O /tmp/pms.key https://dev2day.de/pms/dev2day-pms.gpg.key
    sudo apt-key add /tmp/pms.key
  6. Update apt.
    sudo apt update
  7. Install Plex Media Server.
    sudo apt install plexmediaserver-installer
  8. Create the default config file so Plex knows what user to operate under.
    echo "PLEX_MEDIA_SERVER_USER=pi" | sudo tee -a /etc/default/plexmediaserver
    sudo chown -R pi:pi /var/lib/plexmediaserver
    sudo service plexmediaserver restart

    (Thanks to Steve for submitting this additional step)

  9. Reboot one final time.
    sudo reboot

And there you have it! All the commands we used to get Plex Media Server installed on a Raspberry Pi 3 in a nice clean blog post  🙂

Optional: Use External Storage for Media

From there, we plugged in the USB flash drive (don’t do it! Use a proper external hard drive–this was only a demonstration) and after it mounted we used the following command to see its /dev assignment:

sudo mount

Since our drive was /dev/sda1, and of the filesystem type “fat32” this is what I did to make it work as the media library for Plex Media Server:

sudo nano /etc/fstab

and add the following line:

/dev/sda1 /mnt/library fatfs defaults 0 0

I then created the mountpoint:

sudo mkdir /mnt/library

and made it so it can only be written to if mounted:

sudo chattr +i /mnt/library

and finally, mounted the drive:

sudo mount -a

From there, I could easily add folders on my external drive to Plex using the web interface, which you’ll find on Port 32400 in the /web subfolder on your Pi.

To get my IP address, I brought up the terminal on the Pi and typed:

sudo ifconfig

That showed the IP address of my Pi under “Ethernet”… 192.168.0.105

So to open Plex in my browser, from my computer I entered:

192.168.0.105:32400/web

The IP address will most likely be different for yours, and you might even want to set it up as a static IP. Easiest way to do that would be to use your router’s DHCP reservations to hard-set the Pi to something outside your DHCP pool. For me, that’d be 192.168.0.5 or something like that, since the pool seemingly starts at 100.

Good luck, and if you have any questions or comments, please leave them below. Don’t forget, if this has helped you out, or if you just love supporting nice guys who wanna keep giving knowledge for free, please head over to our Patreon page, or throw a bit in the tip jar. Thanks!

Make it so mountpoint can’t be written to if not mounted.

Have you ever accidentally saved files to a Linux mountpoint when the drive wasn’t mounted, and then couldn’t mount the drive thereafter? Or worse, had a backup run when the backup drive wasn’t mounted, only to fill your filesystem and crash the server?

These problems can be avoided by simply making your mountpoint immutable! What this means is, your mountpoint (the folder itself) cannot be written to. However, even as an immutable folder, it can be mounted to, and the filesystem of the mounted drive then controls the permissions of the folders therein.

It’s a simple Linux command. We’ll pretend our mountpoint is simply /mountpoint. Here’s all you have to do:

chattr +i /mountpoint

Brilliant! And oh, so simple.

Here’s a sample of what happens when I do this as root. Note that ‘mymountpoint’ is setup for me in my /etc/fstab file so it normally auto-mounts.

root@server:/# umount mymountpoint
root@server:/# chattr +i mymountpoint
root@server:/# cd mymountpoint
root@server:/mymountpoint# touch test
touch: cannot touch `test': Permission denied
root@server:/mymountpoint# mount -a
root@server:/mymountpoint# touch test
root@server:/mymountpoint#

Enjoy that little tidbit!

As a side note, you might want to also get a notification if your drive isn’t mounted… so you could use the mountpoint command to send you an email if there’s a problem. Just add something like this to your backup script:

mountpoint -q /mymountpoint || mail -s "/mountpoint is not mounted for the backup" [email protected]

That simply checks if /mountpoint is a mounted mountpoint. If yes, it does nothing. If no, it will send you an email.

-Robbie

Convert video to several JPG images on Linux without ffmpeg.

These days I just use this command and hit CTRL-C when the video frames (V:) stop moving:

mplayer -vo jpeg:outdir=screenshots -sstep 10 filename.mp4

But, this post remains for the sake of historical record – lol!


I admit… I do love PHP in the command line. Does that make me a bad person? 😉

Here’s a tiny little script that I wrote to create many JPG screenshots of a video file. I use this each week to create a bunch of stills from our broadcast so I can use them as thumbnails and so-on. I didn’t want it to depend on ffmpeg since I don’t have that on any of my modern systems.

It requires just three packages: mplayer mediainfo php-5

Save it as whatever.php and run it like this: php whatever.php file.wmv

It will create a folder called file-Screenshots/ and will save one picture per 10 seconds for any video source. Just change “file.wmv” to the name of your video. Include the path if it’s not in the current folder.

<?php
  // Depends: mplayer mediainfo
  // Does not need ffmpeg (deprecated)

  if ($_GET) {
    $file = $_GET['file'];
  } else {
    $file = $argv[1];
  }
  
  if (strlen($file) < 3) exit('Need a proper filename for input.' . PHP_EOL);  
  $dir = array_shift(explode('.',$file)) . '-Screenshots';

  $duration = duration($file);
  echo 'Duration in Seconds: ' . $duration . PHP_EOL;
  echo 'Saving to folder:    ' . $dir . PHP_EOL;
  echo 'Creating ' . ($duration/10) . ' JPG images from source...';
  exec('mplayer -vo jpeg:outdir=' . $dir . ' -sstep 10 -endpos ' . ($duration-2) . ' ' . $file . ' > /dev/null 2>&1');
  echo ' Done.' . PHP_EOL; 

  function duration($file) {
    if (file_exists($file)) {
      exec('mediainfo -Inform="Audio;%ID%:%Format%:%Language/String%\n" ' . $file . ' | grep -m1 Duration | cut -d\':\' -f2',$result);
      $tmp = explode('h',$result[0]);
      $seconds = ((intval($tmp[0]*60)+intval($tmp[1]))*60);
      return intval(trim($seconds));
    } else {
      exit('File ' . $file . ' not found.' . PHP_EOL);
    }
  }
?>

Hope it helps you out.

-Robbie

Find the version number of all WordPress installations on your Linux server.

I have a lot of customers running WordPress on our shared hosting servers, and sometimes they neglect to update their WordPress installs. [Rolls Eyes]

I need to know which of these sites are using an obsolete version of WordPress so I may contact the customer and warn them that they need to update their software.

So here’s a helpful little Linux command I whipped up and ran as root to go through my /home folder searching for all WordPress versions. I only had to run it as root because I am checking through all users’ folders, not just my own. If you only want to check your own user, you don’t need root access.

I ran this command from my /home folder on the Linux server:

find . -name ‘version.php’ -exec grep ‘$wp_version =’ {} /dev/null \; > /tmp/wordpress-versions.log

Breakdown:

  • find . -name ‘version.php’
    Search through the current folder, recursively, for any file named version.php. This is where WordPress stores the WordPress version number.
  • -exec
    Execute a command with each found item.
  • grep ‘$wp_version =’ {}
    Look within the found version.php file(s) in a loop for the term $wp_version = and output the result.
  • /dev/null
    Trick grep into thinking there is a second file, forcing it to precede the output with the filename provided by find
  • \;
    Close the find command.
  • > /tmp/wordpress-versions.log
    Save the results to a log file in /tmp. You can tail -f this file while scanning, or simply open or cat it when you’re done. Leave this portion out of the command if you’d rather have it output directly to your screen.

Preventing rsync from doubling–or even tripling–your S3 fees.

Using rsync to upload files to Amazon S3 over s3fs?  You might be paying double–or even triple–the S3 fees.

I was observing the file upload progress on the transcoder server this morning, curious how it was moving along, and I noticed something: the currently uploading file had an odd name.

My file, CAT5TV-265-Writing-Without-Distractions-With-Free-Software-HD.m4v was being uploaded as .CAT5TV-265-Writing-Without-Distractions-With-Free-Software-HD.m4v.f100q3.

I use rsync to upload the files to the S3 folder over S3FS on Debian, because it offers good bandwidth control.  I can restrict how much of our upstream bandwidth is dedicated to the upload and prevent it from slowing down our other services.

Noticing the filename this morning, and understanding the way rsync works, I know the random filename gets renamed the instant the upload is complete.

In a normal disk-to-disk operation, or when rsync’ing over something such as SSH, that’s fine, because a mv this that doesn’t use any resources, and certainly doesn’t cost anything: it’s a simple rename operation. So why did my antennae go up this morning? Because I also know how S3FS works.

A rename operation over S3FS means the file is first downloaded to a file in /tmp, renamed, and then re-uploaded.  So what rsync is effectively doing is:

  1. Uploading the file to S3 with a random filename, with bandwidth restrictions.
  2. Downloading the file to /tmp with no bandwidth restrictions.
  3. Renaming the /tmp file.
  4. Re-uploading the file to S3 with no bandwidth restrictions.
  5. Deleting the temp files.

Fortunately, this is 2013 and not 2002.  The developers of rsync realized at some point that direct uploading may be desired in some cases.  I don’t think they had S3FS in mind, but it certainly fits the bill.

The option is –inplace.

Here is what the manpage says about —inplace:

This option changes how rsync transfers a file when its data needs to be updated: instead of the default method of creating a new copy of the file and moving it into place when it is complete, rsync instead writes the update data directly  to  the destination file.

It’s that simple!  Adding –inplace to your rsync command will cut your Amazon S3 transfer fees by as much as 2/3 for future rsync transactions!

I’m glad I caught this before the transcoders transferred all 314 episodes of Category5 Technology TV to S3.  I just saved us a boatload of cash.

Happy coding!

– Robbie