diff options
| author | Franck Cuny <franck@fcuny.net> | 2024-12-06 17:37:28 -0800 |
|---|---|---|
| committer | Franck Cuny <franck@fcuny.net> | 2024-12-06 17:37:41 -0800 |
| commit | 0a350777002ba638bcd44eb23db323b12f7c5d9e (patch) | |
| tree | f512c7c320e34d301c4f3715f16a098552c2b530 /content/blog | |
| parent | some style changes for the default template (diff) | |
| download | fcuny.net-0a350777002ba638bcd44eb23db323b12f7c5d9e.tar.gz | |
get rid of sections
I use tags to organize things.
Diffstat (limited to 'content/blog')
| -rw-r--r-- | content/blog/1password-ssh-agent.md | 208 | ||||
| -rw-r--r-- | content/blog/_index.md | 6 | ||||
| -rw-r--r-- | content/blog/git-link-and-sourcegraph.md | 52 | ||||
| -rw-r--r-- | content/blog/google-doc-failure.md | 69 | ||||
| -rw-r--r-- | content/blog/leaving-twitter.md | 14 | ||||
| -rw-r--r-- | content/blog/nix-raid-systemd-boot.md | 53 | ||||
| -rw-r--r-- | content/blog/no-ssh-to-prod.md | 29 | ||||
| -rw-r--r-- | content/blog/tailscale-docker-https.md | 127 |
8 files changed, 0 insertions, 558 deletions
diff --git a/content/blog/1password-ssh-agent.md b/content/blog/1password-ssh-agent.md deleted file mode 100644 index 5d5d436..0000000 --- a/content/blog/1password-ssh-agent.md +++ /dev/null @@ -1,208 +0,0 @@ -+++ -title = "1password's ssh agent and nix" -date = 2023-12-02 -[taxonomies] -tags = ["nix"] -+++ - -[A while ago](https://blog.1password.com/1password-ssh-agent/), 1password introduced an SSH agent, and I've been using it for a while now. The following describe how I've configured it with `nix`. All my ssh keys are in 1password, and it's the only ssh agent I'm using at this point. - -## Personal configuration - -I have a personal 1password account, and I've created a new SSH key in it that I use for both authenticating to github and to sign commits. I use [nix-darwin](http://daiderd.com/nix-darwin/) and [home-manager](https://github.com/nix-community/home-manager) to configure my personal machine. - -This is how I configure ssh: - -```nix -programs.ssh = { - enable = true; - forwardAgent = true; - serverAliveInterval = 60; - controlMaster = "auto"; - controlPersist = "30m"; - extraConfig = '' - IdentityAgent "~/Library/Group Containers/2BUA8C4S2C.com.1password/t/agent.sock" - ''; - matchBlocks = { - "github.com" = { - hostname = "github.com"; - user = "git"; - forwardAgent = false; - extraOptions = { preferredAuthentications = "publickey"; }; - }; - }; -}; -``` - -The configuration for git: - -```nix -{ lib, pkgs, config, ... }: -let - sshPub = builtins.fromTOML ( - builtins.readFile ../../configs/ssh-pubkeys.toml - ); -in -{ - home.file.".ssh/allowed_signers".text = lib.concatMapStrings (x: "franck@fcuny.net ${x}\n") (with sshPub; [ ykey-laptop ykey-backup op ]); - - programs.git = { - enable = true; - userName = "Franck Cuny"; - userEmail = "franck@fcuny.net"; - - signing = { - key = "key::${sshPub.op}"; - signByDefault = true; - }; - - extraConfig = { - gpg.format = "ssh"; - gpg.ssh.allowedSignersFile = "~/.ssh/allowed_signers"; - gpg.ssh.program = "/Applications/1Password.app/Contents/MacOS/op-ssh-sign"; - }; -} -``` - -In the repository with my nix configuration, I've a file `ssh-pubkeys.toml` that contains all the public ssh keys I keep track of (mine and a few other developers). Keys from that file are used to create the file `~/.ssh/allowed_signers` that is then used by `git` (for example `git log --show-signature`) when I want to ensure commits are signed with a valid key. - -`ssh-pubkeys.toml` looks like this: - -```toml -# yubikey key connected to the laptop -ykey-laptop="ssh-ed25519 ..." -# backup yubikey key -ykey-backup="ssh-ed25519 ..." -# 1password key -op="ssh-ed25519 ..." -``` - -And the following is for `zsh` so that I can use the agent for other commands that I run in the shell: - -```nix -programs.zsh.envExtra = '' - # use 1password ssh agent - # see https://developer.1password.com/docs/ssh/get-started#step-4-configure-your-ssh-or-git-client - export SSH_AUTH_SOCK=~/Library/Group\ Containers/2BUA8C4S2C.com.1password/t/agent.sock -''; -``` - -And that's it, this is enough to get use the agent for all my personal use cases. - -## Work configuration - -The work configuration is slightly different. Here I want to use both my work and personal keys so that I can clone some of my personal repositories on the work machine (for example my emacs configuration). We also use both github.com and a github enterprise instance and I need to authenticate against both. - -I've imported my existing keys into 1password, and I keep the public keys on the disk: `$HOME/.ssh/work_gh.pub` and `$HOME/.ssh/personal_gh.pub`. I've removed the private keys from the disk. - -This is the configuration I use for work: - -```nix -programs.ssh = { - enable = true; - forwardAgent = true; - serverAliveInterval = 60; - controlMaster = "auto"; - controlPersist = "30m"; - extraConfig = '' - IdentityAgent "~/Library/Group Containers/2BUA8C4S2C.com.1password/t/agent.sock" - ''; - matchBlocks = { - "personal" = { - hostname = "github.com"; - user = "git"; - forwardAgent = false; - identifyFile = "~/.ssh/personal_gh.pub"; - identitiesOnly = true; - extraOptions = { preferredAuthentications = "publickey"; }; - }; - "work" = { - hostname = "github.com"; - user = "git"; - forwardAgent = false; - identifyFile = "~/.ssh/work_gh.pub"; - identitiesOnly = true; - extraOptions = { preferredAuthentications = "publickey"; }; - }; - "github.enterprise" = { - hostname = "github.enterprise"; - user = "git"; - forwardAgent = false; - identifyFile = "~/.ssh/work_gh.pub"; - identitiesOnly = true; - extraOptions = { preferredAuthentications = "publickey"; }; - }; - }; -}; -``` - -I also create a configuration file for the 1password agent, to make sure I can use the keys from all the accounts: - -```nix - # Generate ssh agent config for 1Password - I want both my personal and work keys - home.file.".config/1Password/ssh/agent.toml".text = '' - [[ssh-keys]] - account = "my.1password.com" - [[ssh-keys]] - account = "$work.1password.com" - ''; -``` - -Then the ssh configuration: - -```nix -{ config, lib, pkgs, ... }: -let - sshPub = builtins.fromTOML ( - builtins.readFile ../etc/ssh-pubkeys.toml - ); -in -{ - home.file.".ssh/allowed_signers".text = lib.concatMapStrings (x: "franck@fcuny.net ${x}\n") (with sshPub; [ work_laptop op ]); - - programs.git = { - enable = true; - - signing = { - key = "key::${sshPub.op}"; - signByDefault = true; - }; - - extraConfig = { - gpg.format = "ssh"; - gpg.ssh.allowedSignersFile = "~/.ssh/allowed_signers"; - gpg.ssh.program = "/Applications/1Password.app/Contents/MacOS/op-ssh-sign"; - - url = { - "ssh://git@github.enterprise/" = { - insteadOf = "https://github.enterprise/"; - }; - }; - }; - }; -} -``` - -Now, when I clone a repository, instead of doing `git clone git@github.com/$WORK/repo` I do `git clone work:/$WORK/repo`. - -## Conclusion - -I've used yubikey to sign my commits for a while, but I find the 1password ssh agent a bit more convenient. The initial setup for yubikey was not as straightforward (granted, it's a one time thing per key). - -On my personal machine, my `$HOME/.ssh` looks as follow: - -```sh -➜ ~ ls -l ~/.ssh ~ -total 16 -lrwxr-xr-x@ 1 fcuny staff 83 Nov 6 17:03 allowed_signers -> /nix/store/v9qhbr2vb7w6bd24ypbjjz59xis3g8y2-home-manager-files/.ssh/allowed_signers -lrwxr-xr-x@ 1 fcuny staff 74 Nov 6 17:03 config -> /nix/store/v9qhbr2vb7w6bd24ypbjjz59xis3g8y2-home-manager-files/.ssh/config --rw-------@ 1 fcuny staff 828 Nov 13 17:53 known_hosts -``` - -When I create a new commit, 1password ask me to authorize git to use the agent and sign the commit. Same when I want to ssh to a host. - -When I'm working on the macbook, I use touch ID to confirm, and when the laptop is connected to a dock, I need to type my 1password's password to unlock it and authorize the command. - -There's a cache in the agent so I'm not prompted too often. I find this convenient, I will never have to copy my ssh key when I get a new laptop, since it's already in 1password. - -The agent has worked flawlessly so far, and I'm happy with this setup. diff --git a/content/blog/_index.md b/content/blog/_index.md deleted file mode 100644 index d44a9f7..0000000 --- a/content/blog/_index.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -title: Blog -sort_by: date -render: true -template: blog.html ---- diff --git a/content/blog/git-link-and-sourcegraph.md b/content/blog/git-link-and-sourcegraph.md deleted file mode 100644 index c86b465..0000000 --- a/content/blog/git-link-and-sourcegraph.md +++ /dev/null @@ -1,52 +0,0 @@ -+++ -title = "emacs' git-link and sourcegraph" -date = 2021-08-24 -[taxonomies] -tags = ["emacs"] -+++ - -I use [sourcegraph](https://sourcegraph.com/) for searching code, and I sometimes need to share a link to the source code I'm looking at in a buffer. For this, the package [`git-link`](https://github.com/sshaw/git-link) is great. - -To integrate sourcegraph and `git-link`, the [documentation](https://github.com/sshaw/git-link#sourcegraph) recommends adding a remote entry named `sourcegraph` in the repository, like this: - -```bash -git remote add sourcegraph https://sourcegraph.com/github.com/sshaw/copy-as-format -``` - -The next time you run `M-x git-link` in a buffer, it will use the URL associated with that remote. That's works great, except that now you need to add this for every repository. Instead, for my usage, I came up with the following solution: - -```lisp -(use-package git-link - :ensure t - :after magit - :bind (("C-c g l" . git-link) - ("C-c g a" . git-link-commit)) - :config - (defun fcuny/get-sg-remote-from-hostname (hostname) - (format "sourcegraph.<$domain>.<$tld>/%s" hostname)) - - (defun fcuny/git-link-work-sourcegraph (hostname dirname filename _branch commit start end) - ;;; For a given repository, build the proper link for sourcegraph. - ;;; Use the default branch of the repository instead of the - ;;; current one (we might be on a feature branch that is not - ;;; available on the remote). - (require 'magit-branch) - (let ((sg-base-url (fcuny/get-sg-remote-from-hostname hostname)) - (main-branch (magit-main-branch))) - (git-link-sourcegraph sg-base-url dirname filename main-branch commit start end))) - - (defun fcuny/git-link-commit-work-sourcegraph (hostname dirname commit) - (let ((sg-base-url (fcuny/get-sg-remote-from-hostname hostname))) - (git-link-commit-sourcegraph sg-base-url dirname commit))) - - (add-to-list 'git-link-remote-alist '("twitter" fcuny/git-link-work-sourcegraph)) - (add-to-list 'git-link-commit-remote-alist '("twitter" fcuny/git-link-commit-work-sourcegraph)) - - (setq git-link-open-in-browser 't)) -``` - -We use different domains to host various git repositories at work (e.g. `git.$work`, `gitfoo.$work`, etc). Each of them map to a different URI for sourcegraph (e.g. `sourcegraph.$work/gitfoo`). - -`git-link-commit-remote-alist` is an [association list](https://www.gnu.org/software/emacs/manual/html_node/elisp/Association-Lists.html) that takes a regular expression and a function. The custom function receives the hostname for the remote repository, which is then used to generate the URI for our sourcegraph instance. I then call `git-link-sourcegraph` replacing the hostname with the URI for sourcegraph. - -Now I can run `M-x git-link` in any repository where the host for the origin git repository matches `twitter` without having to setup the custom remote first. diff --git a/content/blog/google-doc-failure.md b/content/blog/google-doc-failure.md deleted file mode 100644 index b4a65b9..0000000 --- a/content/blog/google-doc-failure.md +++ /dev/null @@ -1,69 +0,0 @@ -+++ -title = "Google Doc Failures" -date = 2021-04-11 -[taxonomies] -tags = ["practices"] -+++ - -In most use cases, Google Doc is an effective tool to create "write once, read never" documents. - -## Convenience - -Google Doc (GDoc from now on) is the most common way of writing and sharing documents at my current job. It's very easy to start a new document, even more since we can now point our browser to <https://doc.new> and start typing right away. - -Like most of my co-workers, I use it frequently during the day. Some of these documents are draft for some communication that I want others to review before I share with a broader audience; it can be a [Request For Comments](https://en.wikipedia.org/wiki/Request_for_Comments) for a project; meeting notes for others to read; information that I need to capture during an incident or a debugging session; interviews notes; etc. - -I would not be surprised if the teams I work closely with generate 50 new documents each week. - -## ETOOMANYTABS - -I have a tendency of having hundreds of open tabs in my browser during the week. A majority of these tabs are GDocs, and I think this is one of the true failure of the product. Why do I have so many tabs ? There's mainly two reasons. - -The first reason is a problem with Chrome's UX itself: it happily let me open the same URL as many times as I want in as many tabs, instead of sending me to the already opened tab if the document is loaded. It's not uncommon that I find the same document opened in 5 different tabs. - -The second reason, and it's the most important one, I know that if I need to read or comment on a doc and I close the tab, I'll likely never find that document again, or will completely forget about it. - -## Discoverability - -In 'the old days', you'd start a new document in Word or LibreOffice, and as you hit "save" for the first time, you've two decisions to make: how am I going to name that file, and where am I going to save it on disk. - -With GDoc these questions don't have to be answered, you don't have to name the file, and it does not matter where it lives. I've likely hundreds of docs named 'untitled' in my "drive". I also don't have to think about where they will live, because they are saved automatically for me. I'm sure there's hundreds of studies that show that these two simple steps are actually complex for many users and creates useless friction (in which folder do I store it; should I organize the docuemnts by team, years, projects; do I name it with the date and the current project; etc.). - -GDoc being a Google product, it seems pretty obvious that they would come up with a better solution: let's not organize in a strict hierarchy these files, and let's instead search for them. - -Unfortunately, GDoc's search is really poor (and I'm being kind). By default most of us start by looking for some words we know are in the doc, maybe even in the title. But when working on a multiple projects that are related to the same technology, you suddenly get hundreds of documents matching your query. It's unclear how the returned set is ordered (by date ? by author ? by some scoring that is invisible to me ?). - -You can also search by owners, but here is another annoying bit: I think about owner as author, so I usually type `author:foo` before realizing it does not work. And that implies you already know who's the owner of the document. In the case of TDDs (Technical Design Document), I might know which team is behind it, but rarely who's the actual author. - -I could search for the title, but I rarely remember or know the name of the document I'm looking for. I could also be looking by keywords, but when working on a project with tens of related documents, you have to open all the returned docs to see which one is the correct one. - -And then what about new members joining your the team ? They don't know which docs exist, who wrote them, and how they are named. They end up searching and hoping that something good will be returned. - -## Workflows - -More and more we create workflows around these documents: some of the docs are TDDs that are going through reviews; others are decision documents that require input from multiple teams and are pending approval; others are road map documents that also go through some review process. - -As a result we create templates for all kind of documents, with usually something like "draft → reviews → approved/rejected" at the top. We expect the owner of the doc to mark in bold what's the status of the doc to help the reader understand in what state the document is. It's difficult to keep track of open actions and comments. Yes, there's a way to get a list of all of them, but it's not in an obvious place. - -As a result, some engineers in my team built an external dashboard with swim lanes which captures the state of a document. We add new document with their URLs, add who are the reviewers, and we move the doc between the lanes. Now we have to operate a service and a database to keep track of the status of documents in GDoc. - -## Alternatives - -When it comes to technical document, I find that [approach](https://caitiem.com/2020/03/29/design-docs-markdown-and-git/) much more interesting. Some open source projects have adopted a similar workflow ([Kubernetes](https://github.com/kubernetes/enhancements/tree/master/keps), [Go](https://github.com/golang/proposal)). - -A new document starts its life as a text file (using what ever markup language your team/company prefers). The document is submitted for review, and the people who need to be consulted are added as reviewers. They can now comment on the document, the author can address them, mark them as resolved. It's clear in which state the document is: it's either in review, committed, or rejected. With this approach you also end up with a clear history, as time moves on you can amend the document by submitting a change, and the change goes through the same process. - -New comers will find the document in the repository, and if they want to see the conversation they can open the review associated with the original change. They can also see how the document evolved over time. It's also easy to publish these documents on an internal website, using a static site generator for example. - -One of the thing that I think are critical, is that all of that is done using the tools the engineers are already using for their day to day job: a text editor, a version control system, a code review tool. - -There's obviously challenges with this approach too: - -- **it's more heavy handed**: not every one likes to write in a text editor using a markup language. It can requires some time to learn or get used to the syntax -- **it's harder to integrate schema / visuals**: but having them checked in in the repository also improves the discoverability - -It's also true that no all documents suffer the same challenges for discoverability: - -- meeting notes are usually linked to meeting invites (however if you were not part of the meeting, you end up with the same challenges to discover them) -- drafts for communications are usually not relevant once the communication has been sent -- interview notes are usually transferred to some tools for HR when the feedback is submitted diff --git a/content/blog/leaving-twitter.md b/content/blog/leaving-twitter.md deleted file mode 100644 index f7d98f5..0000000 --- a/content/blog/leaving-twitter.md +++ /dev/null @@ -1,14 +0,0 @@ -+++ -title = "Leaving Twitter" -date = 2022-01-15 -[taxonomies] -tags = ["work"] -+++ - -January 7th 2022 was my last day at Twitter, after more than 7 years at the company. - -The first few years I worked as an SRE in the core-storage team, with the PUB/SUB and key-value store teams. - -I spend the last four years working with the Compute team, both maintaining and operating our (very large) Aurora/Mesos clusters, and also working on the adoption of kubernetes, both for our data centers and for the cloud. Working with Compute was extremely fulfilling to me, as I worked closely with our hardware engineering and kernel/operating system teams. - -During these 7 years, I was constantly pushed by my coworkers to grow, to step up to new challenges, and I learned a tremendous amount about running large scale distributed systems. I'm extremely glad for that experience, it was by far the most interesting and challenging job I've ever had so far. diff --git a/content/blog/nix-raid-systemd-boot.md b/content/blog/nix-raid-systemd-boot.md deleted file mode 100644 index de68695..0000000 --- a/content/blog/nix-raid-systemd-boot.md +++ /dev/null @@ -1,53 +0,0 @@ -+++ -title = "Workaround md raid boot issue in NixOS 22.11" -date = 2023-01-10 -[taxonomies] -tags = ["nix"] -+++ - -For about a year now I've been running [NixOS](https://nixos.org/ "NixOS") on my personal machines. Yesterday I decided to go ahead and upgrade my NAS from NixOS 22.05 to [22.11](https://nixos.org/blog/announcements.html#nixos-22.11). On that machine, all the disks are encrypted, and there are two RAID0 devices. To unlock the drives, I log into the [SSH daemon running in `initrd`](https://nixos.wiki/wiki/Remote_LUKS_Unlocking), where I can type my passphrase. This time however, instead of a prompt to unlock the disk, I see the following message: - -``` -waiting for device /dev/disk/by-uuid/66c58a92-45fe-4b03-9be0-214ff67c177c to appear... -``` - -followed by a timeout and then I'm asked if I want to reboot the machine. I do reboot the machine, and same thing happens. - -Now, and this is something really great about NixOS, I can boot to the previous generation (on 22.05), and this time I'm prompted for my password, the disks are unlocked, and I can log into my machine. This eliminates the possibility of a hardware failure! I also have a way to get a working machine to do more build if needed. Knowing that I can easily switch from a broken generation to a working one gives me more confidence in making changes to my system. - -I then reboot again in the broken build, and drop into a `busybox` shell. I look to see what `blkid` reports, and I confirm that my disks are all present and they have a **UUID** set. Next I check what's listed under `/dev/disk/by-uuid` and, surprise, the disks are not there. They are however under `/dev/disk`. Now, looking at `/nix/store` I only see a few things, and one of them is a script named `stage-1-init.sh`. I read quickly the script, checked it does, and confirmed that it was blocking on the disks. I looked at what was reported by `udevadm info </path/to/disk>` and I could see that the `DEVLINKS` was missing the path for `by-uuid`. - -My laptop has a similar setup, but without RAID devices. I had already updated to 22.11, and had rebooted the laptop without issues. To be sure, I ran another update and rebooted, and I was able to unlock the drive and log into the machine without problem. - -From here I have enough information to start searching for an issue similar to this. I got pretty lucky and two issues I found were: - -- [Since systemd-251.3 mdadm doesn't start at boot time #196800 ](https://github.com/nixoS/nixpkgs/issues/196800) -- [Won't boot when root on raid0 with boot.initrd.systemd=true #199551 ](https://github.com/nixoS/nixpkgs/issues/199551) - -The proposed solution was easy: - -```diff -@@ -43,7 +43,7 @@ - }; - - boot.initrd.luks.devices."raid-fast".device = -- "/dev/disk/by-uuid/66c58a92-45fe-4b03-9be0-214ff67c177c"; -+ "/dev/disk/by-id/md-name-nixos:fast"; - - fileSystems."/data/slow" = { - device = "/dev/disk/by-uuid/0f16db51-0ee7-48d8-9e48-653b85ecbf0a"; -@@ -51,7 +51,7 @@ - }; - - boot.initrd.luks.devices."raid-slow".device = -- "/dev/disk/by-uuid/d8b21267-d457-4522-91d9-5481b44dd0a5"; -+ "/dev/disk/by-id/md-name-nixos:slow"; -``` - -I rebuild, rebooted, and success, I was able to get access to the machine. - -## Takeaways - -I now have a mitigation to the problem, however I still don't have a root cause. Since it's only the `by-uuid` path that is missing, and this is managed by `udev`, I'm guessing that some rules for `udev` have changed, but so far I can't find anything about that. - -It's really great to be able to easily switch back to a previous generation of my system, so I can debug and experiment different solutions. If this had happen with another distribution, getting out of this mess would have been more tedious. diff --git a/content/blog/no-ssh-to-prod.md b/content/blog/no-ssh-to-prod.md deleted file mode 100644 index 9c2d20a..0000000 --- a/content/blog/no-ssh-to-prod.md +++ /dev/null @@ -1,29 +0,0 @@ -+++ -title = "No SSH to production" -date = 2022-11-28 -[taxonomies] -tags = ["practices"] -+++ - -It's not uncommon to hear talk about preventing engineers to SSH to production machines. While I think it's a noble goal, I think most organizations are not ready for it in the short or even medium term. - -Why do we usually need to get a shell on a machine ? The most common reason is to investigate a system that is behaving in an unexpected way, and we need to collect information, maybe using `strace`, `tcpdump`, `perf` or one of the BCC tools. Another reason might be to validate that a change deployed to a single machine is applied correctly, before rolling it out to a large portion of the fleet. - -If you end up writing a postmortem after the investigation session, one of the reviewer might ask why did we need to get a shell on the machine in the first place. Usually it's because we're lacking the capabilities to collect that kind of information remotely. Someone will write an action item to improve this, it will be labeled 'long-term-action-item', and it will disappear in the bottomless backlog of a random team (how many organizations have a clear ownership for managing access to machines ?). - -In most cases, I think we would be better off by breaking down the problems in smaller chunk, and focus on iterative improvements. "No one gets to SSH to machines in production" is a poorly framed problem. - -What I think is better is to ask the following questions - -- who has access to the machines -- who actually SSH to the machines -- why do they need to SSH to the machines -- was the state of the machine altered after someone logged to the machine - -For the first question, I'd recommend that we don't create user accounts and don't distribute engineers' SSH public keys on the machines. I'd create an 'infra' user account, and use signed SSH certificates (for example with [vault](https://www.hashicorp.com/products/vault/ssh-with-vault)). Only engineers who _have_ to have access should be able to sign their SSH key. That way you've limited the risks to a few engineers, and you have an audit trail of who requested access. You can build reports from these audit logs, to see how frequently engineer request access. For the 'infra' user, I'd limit it's privileges, and make sure it can only run commands required for debugging/troubleshooting. - -Using linux' audit logs, you can also generate reports on which commands are run. You can learn why the engineers needed to get on the host, and it can be used by the SRE organization to build services and tools that will enable new capabilities (for example, a service to collect traces, or do network capture remotely). - -Using the same audit logs, look for commands that are modifying the filesystems (for example `apt`, `yum`, `mkdir`): if the hosts are stateless, send them through the provisioning pipeline. - -At that point you've hardened the system, and you get visibility into what engineers are doing on these machines. Having engineers being able to get a shell on a production machine is a high risk: even if your disks are encrypted at rest, when the host is running an engineer can see data they are not supposed to look at, etc. But I think knowing who/when/why is more important than completely blocking SSH access: there's always going to be that one incident where there's nothing you can do without a shell on that one host. diff --git a/content/blog/tailscale-docker-https.md b/content/blog/tailscale-docker-https.md deleted file mode 100644 index 1094ca6..0000000 --- a/content/blog/tailscale-docker-https.md +++ /dev/null @@ -1,127 +0,0 @@ -+++ -title = "Tailscale, Docker and HTTPS" -date = "2021-12-29" -[taxonomies] -tags = ["containers"] -+++ - -I run a number of services in my home network. For the majority of these services, I don't want to make them available on the internet, I want to only be able to access them when I'm on my home network. However, sometimes I'm not at home and I still want to access them. So far I've been using plain [wireguard](https://www.wireguard.com/) to achieve this. While the initial configuration for wireguard is pretty simple, it starts to be a bit more cumbersome as I add more hosts/containers. It's also not easy to share keys with other folks if I want to give access to some of the machines or services. For that reason I decided to give a look at [tailscale](https://tailscale.com/). - -There's already a lot of articles about tailscale and how to use and configure it. Their [documentation](https://tailscale.com/kb/) is also pretty good, so I won't cover the initial setup. - -As stated above, I want to access some of my services that are running as docker containers from anywhere. For web services, I want to use them through HTTPS, with a valid certificate, and without having to remember on which port the service it's listening. I also don't want to setup a PKI in my home lab for that (and I'm also not interested in configuring split DNS), and instead I prefer to use [let's encrypt](https://letsencrypt.org/) with a proper subdomain that is unique for each service. - -The [tailscale documentation](https://tailscale.com/kb/1054/dns/) has two suggestions for this: - -- use their magicDNS feature / split DNS -- setup a subdomain on a public domain - -Since I already have a public domain that I use for my home network, I decided to go with the second option (I'm also uncertain how to achieve my goal using magicDNS without running tailscale inside the container). - -The public domain I'm using is managed through [Google Cloud Domain](https://cloud.google.com/dns/docs/tutorials/create-domain-tutorial). I create a new record for the services I want to run (for example, `dash` for my instance of grafana), using the IP address from the tailscale node the service runs on (e.g. 100.83.51.12). - -For routing the traffic I use [traefik](https://traefik.io/). The configuration for traefik looks like this: - -```yaml -global: - sendAnonymousUsage: false -providers: - docker: - exposedByDefault: false -entryPoints: - http: - address: ":80" - https: - address: ":443" -certificatesResolvers: - dash: - acme: - email: franck@fcuny.net - storage: acme.json - dnsChallenge: - provider: gcloud -``` - -The important bit here is the `certificatesResolvers` part. I'll be using the [dnsChallenge](https://doc.traefik.io/traefik/user-guides/docker-compose/acme-dns/) instead of the [httpChallenge](https://doc.traefik.io/traefik/user-guides/docker-compose/acme-http/) to obtain the certificate from let's encrypt. For this to work, I need to specify the `provider` to be [gcloud](https://go-acme.github.io/lego/dns/gcloud/). I'll also need a service account (see [this doc](https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application) to create it). I run `traefik` in a docker container, and the `systemd` unit file is below. The required bits for using the `dnsChallenge` with `gcloud` are: - -- the environment variable `GCP_SERVICE_ACCOUNT_FILE`: it contains the credentials so that `traefik` can update the DNS record for the challenge -- the environment variable `GCP_PROJECT`: the name of the GCP project -- mounting the service account file inside the container (I store it on the host under `/data/containers/traefik/config/sa.json`) - -```ini -[Unit] -Description=traefik proxy -Documentation=https://doc.traefik.io/traefik/ -After=docker.service -Requires=docker.service - -[Service] -Restart=on-failure -ExecStartPre=-/usr/bin/docker kill traefik -ExecStartPre=-/usr/bin/docker rm traefik -ExecStartPre=/usr/bin/docker pull traefik:latest - -ExecStart=/usr/bin/docker run \ - -p 80:80 \ - -p 9080:8080 \ - -p 443:443 \ - --name=traefik \ - -e GCE_SERVICE_ACCOUNT_FILE=/var/run/gcp-service-account.json \ - -e GCE_PROJECT= gcp-super-project \ - --volume=/data/containers/traefik/config/acme.json:/acme.json \ - --volume=/data/containers/traefik/config/traefik.yml:/etc/traefik/traefik.yml:ro \ - --volume=/data/containers/traefik/config/sa.json:/var/run/gcp-service-account.json \ - --volume=/var/run/docker.sock:/var/run/docker.sock:ro \ - traefik:latest -ExecStop=/usr/bin/docker stop traefik - -[Install] -WantedBy=multi-user.target -``` - -As an example, I run [grafana](https://grafana.com/) on my home network to view metrics from the various containers / hosts. Let's pretend I use `example.net` as my domain. I want to be able to access `grafana` via <https://dash.example.net>. Here's the `systemd` unit configuration I use for this: - -```ini -[Unit] -Description=Grafana in a docker container -Documentation=https://grafana.com/docs/ -After=docker.service -Requires=docker.service - -[Service] -Restart=on-failure -RuntimeDirectory=grafana -ExecStartPre=-/usr/bin/docker kill grafana-server -ExecStartPre=-/usr/bin/docker rm grafana-server -ExecStartPre=-/usr/bin/docker pull grafana/grafana:latest - -ExecStart=/usr/bin/docker run \ - -p 3000:3000 \ - -e TZ='America/Los_Angeles' \ - --name grafana-server \ - -v /data/containers/grafana/etc/grafana:/etc/grafana \ - -v /data/containers/grafana/var/lib/grafana:/var/lib/grafana \ - -v /data/containers/grafana/var/log/grafana:/var/log/grafana \ - --user=grafana \ - --label traefik.enable=true \ - --label traefik.http.middlewares.grafana-https-redirect.redirectscheme.scheme=https \ - --label traefik.http.middlewares.grafana-https-redirect.redirectscheme.permanent=true \ - --label traefik.http.routers.grafana-http.rule=Host(`dash.example.net`) \ - --label traefik.http.routers.grafana-http.entrypoints=http \ - --label traefik.http.routers.grafana-http.service=grafana-svc \ - --label traefik.http.routers.grafana-http.middlewares=grafana-https-redirect \ - --label traefik.http.routers.grafana-https.rule=Host(`dash.example.net`) \ - --label traefik.http.routers.grafana-https.entrypoints=https \ - --label traefik.http.routers.grafana-https.tls=true \ - --label traefik.http.routers.grafana-https.tls.certresolver=dash \ - --label traefik.http.routers.grafana-https.service=grafana-svc \ - --label traefik.http.services.grafana-svc.loadbalancer.server.port=3000 \ - grafana/grafana:latest - -ExecStop=/usr/bin/docker stop unifi-controller - -[Install] -WantedBy=multi-user.target -``` - -Now I can access my grafana instance via HTTPS (and <http://dash.example.net> would redirect to HTTPS) while my tailscale interface is up on the machine I'm using (e.g. my desktop or my phone). |
