torstai 9. marraskuuta 2017

Complex systems, root cause analysis and failure

I just read http://www.michaelnygard.com/blog/2017/11/root-cause-analysis-as-storytelling/ and it reminded me about classic "How Complex Systems Fail" ( http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf ) .

We are building complex systems all the time, and it's actually scary how many defenses against failure are built into them. These defenses can be as simple as checking return value of function, or more complex with fallbacks and alternative implementations. They aren't scary because they are there; they are scare when you think that if even one of those defenses is missing, things go bad pretty quickly.

Currently humans are still superior in defending these systems. They make workarounds and processes that avoid potential failures. It might be really interesting to apply machine learning in these situations, trying to find out the sets of actions that lead to failures.

But meanwhile, we have to learn from our systems by ourselves, so try to avoid hunting that one root cause.

keskiviikko 20. syyskuuta 2017

Amazon Cloudformation and tagging

AWS Cloudformation has multiple different commands in aws cli, like "create-stack", "update-stack" and "deploy". Each of these have their good and bad sides. For multiple reasons, we've decided to use "deploy". But the problem then becomes tagging. "Create-stack" and "update-stack" both have support for giving tags which are then propagated to all supported resources, but deploy does not have it. To make things worse, some Cloudformation types does not support tags as their properties, but they seem to get tags from Cloudformation stack if tags are there.

Now we do after deploy "aws cloudformation update-stack --stack-name <some> --tags ...". This becomes quite easy with some scripting when you have jq!



As update-stack wants to have all parameters with "UsePreviousValue=true", use some jq to generate necessary parameters. Then we take existing Parameters we've used for tagging and generate tags from that.

Well, actually "quite easy" is a lie, as I had some problems in understanding right syntax to replace key in JSON array with jq.

tiistai 19. syyskuuta 2017

Docker, Alpine and dillon's cron: "setpgid: Operation not permitted"

For a while, I've been strugling to get dillon's cron working properly in Docker container. The problem has been that when the ENTRYPOINT was anything else than in shell form, I got 'setpgid: Operation not permitted'.

So, this worked:
ENTRYPOINT /usr/sbin/crond -f
None of these seemed to work:
ENTRYPOINT ["/usr/sbin/crond", "-f"]
Or
ENTRYPOINT ["./entrypoint.sh"]
CMD ["/usr/sbin/crond", "-f"]
 As both would give
setpgid: Operation not permitted
But using shell form has been enough, for now. Now as I finally needed to have entrypoint for doing some preparation work, something had to be done.

"su -c" to the rescue.

ENTRYPOINT ["./entrypoint.sh"]
CMD ["su", "-c", "/usr/sbin/crond -f"]
Seems to be working perfectly.

perjantai 30. syyskuuta 2016

Reinstalling NOOBS and Rasbian on Raspberry Pi

I've got a small gluster of six Raspberry Pies, and I wanted to update them all. I'm just too lazy to update every SD card one by one, so I wondered if it is possible to do the update without removing SD cards from Raspberries. And it is!

Installation of NOOBS creates a partition on the SD card, /dev/mmcblk0p1. This partition contains files needed for install (details are available at https://github.com/raspberrypi/noobs/wiki/NOOBS-partitioning-explained). So only thing you need to do is to download new Noobs, mount /dev/mmcblk0p1 on device and replace files.

So you need do something like following to update NOOBS
curl https://downloads.raspberrypi.org/NOOBS_latest -L -o noobs.zip
sudo mount -t vfat  /dev/mmcblk0p1 /mnt
sudo rm -rf /mnt/*
sudo unzip noobs.zip -d /mnt/
Then you can boot up the Raspberry and start the NOOBS recovery by pressing shift -key during start up. But I'm too lazy to do even that. Luckily it is possible to make the NOOBS automatically start recovery and install new OS.

The behaviour of NOOBS can be controlled with commandline options. These options are defined in file called "recovery.cmdline"  in the root of  /dev/mmcblk0p1. The default contents of the file are following:

quiet ramdisk_size=32768 root=/dev/ram0 init=/init vt.cur_default=1 elevator=deadline

To make the installer start by default, you have to add "runinstaller" option. This only starts the installer, but it will need user input to continue. Another option, silentinstall, will tell the installer to go forth and install OS. Just make sure that there is only one OS in os/ -directory, and if it has more that one flavour, edit it's flavours.json file (details in https://github.com/raspberrypi/noobs#how-to-automatically-install-an-os).

So the recovery.cmdline should have following contents
runinstaller silentinstall quiet ramdisk_size=32768 root=/dev/ram0 init=/init vt.cur_default=1 elevator=deadline
After installation, the installer does remove the "runinstaller" -option from recovery.cmdline so it does not reinstall on every boot. The "silentinstall" option remains, though.

So when everything is in place, at next reboot, there's a new version of noobs, and it will install the OS automatically. Just remember, that everything on Raspberry will be wiped!

Here's a ansible playbook that does everything. It will take quite a while to complete, as the NOOBS image file is pretty big and takes a while to download and transfer to hosts. Reason why I'm downloading NOOBS to local machine is that I'm running this playbook with six Raspberries and it should be faster to download NOOBS once to local machine and the transfer it to Raspberries instead if downloading it on every Raspberry
- hosts: all
  vars:
    - noobs_file: noobs.zip
    - recovery_directory: /mnt/recovery
  tasks:
    - name: download noobs if not present
      local_action: get_url url=https://downloads.raspberrypi.org/NOOBS_latest dest={{playbook_dir}}/{{noobs_file}}
      become: no 
    - name: mount device
      mount: name=/mnt/recovery src=/dev/mmcblk0p1 fstype=vfat state=mounted 
    - name: remove old noobs
      file: path={{recovery_directory}}/* state=absent 
    - name: unzip noobs
      unarchive: src={{noobs_file}} dest={{recovery_directory}} owner=root group=root 
    - name: set reinstall
      lineinfile: dest={{recovery_directory}}/recovery.cmdline regexp='^(runinstaller)?\s?(silentinstall)?\s?(.*)$' line='runinstaller silentinstall \3' backrefs=yes 
    - name: unmount device
      mount: name=/mnt/recovery src=/dev/mmcblk0p1 fstype=vfat state=unmounted 
    - name: reboot
      command: shutdown -r now
      ignore_errors: True
  become: yes



 

torstai 8. syyskuuta 2016

Getting full error message from "docker service ps"


I was trying out docker swarm, network and services, and for some reason my nginx containers failed to start. Unfortunately, "docker service ps my-web" truncated the error, giving something like below
e5qw27qr4qbc9vrm68g3i9tl0   my-web.1  nginx  node3  Shutdown       Failed 2 seconds ago          "starting container failed: ca…"
There will be "--no-trunc" in version 1.13, which should resolve this. Meanwhile, using "docker inspect e5qw27qr4qbc9vrm68g3i9tl0" (id from docker service ps) gave the full error message.

In this case, the VM created with docker-machine did not have necessary pieces to connect secured network.

sunnuntai 7. elokuuta 2016

Ubuntu Xenial64 on Virtual Box and Vagrant


There was a lot of strange problems with ubuntu/xenial64, and in https://github.com/mitchellh/vagrant/issues/6616 there is a mention by Seth Vargo (employee of Hashicorp)
    
The ubuntu/xenial64 box is built wrong and horribly broken. Please note that "ubuntu" is the name of a user, not a representation of a canonical source for ubuntu images. Please try bento/ubuntu-16.04 instead. Thanks.
https://github.com/mitchellh/vagrant/issues/6616#issuecomment-227776489

These errors included following:

rejecting i/o to offline device
This happened almost everytime after heavier I/O operations, for example after loading Docker images.

stderr: Inappropriate ioctl for device
I think that this happened when Vagrant tried to setup network interfaces, mainly "enp0s8".

So just use bento/ubuntu-16.04

tiistai 5. heinäkuuta 2016

Jenkins Workflow: Executing build step for every change in commit

At work, we wanted send an email for every change that was made in a project. By default, Jenkins likes to collate changes into as few builds as possible, and normally sends an email per build.

The solution seemed to be usage of Jenkins Pipeline. Jenkins Pipeline enables creation and execution of jobs "on the fly" as needed.

First problem was to get access to ChangeLogSet. There is some preset variables in Jenkinsfile, but I could not find documentation for them. After some googling, Stack Overflow came to rescue.

def changes = currentBuild.rawBuild.changeSets

But when this was executed, Jenkins complained
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use method org.jenkinsci.plugins.workflow.support.steps.build.RunWrapper getRawBuild


There's a "In-process Script Approval" -tool in Jenkins, where you can allow usage of these methods.

After that was solved, next problem was with serialization. As the actual job execution is transferred to different node, every non-serializable object caused an exception. To prevent this, I had to null objects in proper places. This then prevented running jobs in loop, as the variables in loops needed to be nulled before job execution. So I had to collect jobs into map, and after every job was defined, null everything and use "parallel" -task to execute jobs.

So the whole thing is here:

//changes is http://javadoc.jenkins-ci.org/hudson/scm/ChangeLogSet.html
def changes = currentBuild.rawBuild.changeSets
//We need to create branches for later execution, as otherwise there would be serialization exceptions
branches = [:]
for (int j = 0; j < changes.size(); j++) {
    def change = changes.get(j)
    for (int i = 0; i < change.getItems().size(); i++) {
        def entry = change.getItems()[i]
        def commitTitleWithCaseNumber = entry.getMsg()
        def commitMessage = entry.getComment()
        //split from first non digit
        def caseNumber = (commitTitleWithCaseNumber =~ /^[0-9]*/)
        // check that caseNumber was in case place
        if( !caseNumber[0].isEmpty() && commitTitleWithCaseNumber.startsWith(caseNumber[0])) {
          // Remove number from title, just for nicer subject line
          def commitTitle = commitTitleWithCaseNumber.substring(caseNumber[0].length()).trim()
          def number = caseNumber[0]
          branches["mail-${j}-${i}"] = {
              node {
                  emailext body: commitMessage, subject: "[Sysart ${number}] ${commitTitle}", to: 'redacted@example.com'
             }
          }
        }
        // Need to forcibly null all non serializable classes
        caseNumber = null
        entry = null
    }
    change = null
}
changes = null
stage 'Mail'
parallel branches
This was a little more difficult that I had expected, mainly because of serialization complications. But in the end, it works so it cannot be completely stupid.