Oracle Cloud Server Failed Restart - Postmortem
This site is currently hosted on an Oracle Cloud “Always Free” server. I migrated here from Digital Ocean, where I was paying $8/month ($6/month for a droplet and $2/month for block storage) for only 1GB of RAM, where I now have 6GB. The Oracle Cloud Always Free tier is a really good deal, and comes with a generous Free Trial that I never got around to actually making use of.
I made the switch back in November, and although setting it up was more complicated than I thought it needed to be – with the networking and firewall settings in a whole different part of the dashboard than the actual compute instance, and kind of complicated at that – since then things had been running quite smoothly. With the extra RAM I was able to host some other fun things on my server that I couldn’t before, and that if you’re a hacker you will probably be able to find ;) Please don’t destroy my nice little site, though.
If you’re thinking of doing this yourself, this is the guide I used for configuring the firewall stuff for my server, which I found to be the only part that wasn’t intuitive off the bat.
For reference I am running Ubuntu 22.04 Minimal on the VM.Standard.A1.Flex shape with 2 OCPUs and 6GB RAM.
The Problem
The problem arose the other day when I logged into my server and saw that it wanted a restart so it could install
some package updates. Just out of habit, thinking my server would bounce right back up after a couple of minutes, I
did your basic sudo restart
. After a couple of minutes, though, my website still wasn’t available, and I still
couldn’t SSH in. The Oracle dashboard said that my instance was running so I figured I would simply give it another
restart via the dashboard.
Given there clearly was something going on, I should have checked the box to force the restart, but since I didn’t my box got stuck in the “stopping…” state for quite a while. When I tried to force restart while it was stuck, I got some error that the instance was currently being modified, like this person.
What I Tried
Thankfully, I guess since other people had this problem and complained about it to Oracle, they’ve implemented a new policy where an instance that’s in the “stopping” state for too long will be powered off after 15 minutes. It was definitely longer than that for me with this first restart; well over half an hour until my instance was “stopped” and I could try forcing it to bounce some more.
The very first thing I tried was to log in via a cloud shell connection, providing it the same public key I normally use to SSH in, but I ran into a problem where it kept asking for login credentials but not accepting the ones for my account. I’m not sure if the default credentials would work; disabling the default user is usually one of the first things I do for not-getting-hacked reasons. So, that quickly proved to be a dead end. I also tried the “Create local connection,” but it didn’t work, probably for the same reason that SSHing normally didn’t work. That allowed me to rule out a DNS issue, though, at least. Finally, I tried opening up the “Cloud Shell” at the top of the Oracle Cloud window and then going to my instance from there, but I don’t use that Cloud Shell feature much and didn’t get any better results.
I also tried feeding it a command via the “Run Commands” menu but the box never delivered a result.
At this point, I’ve been able to force-restart it multiple times, but with the same result: The box starts back up,
Oracle Cloud’s dashboard shows that it’s running, but neither the SSH service nor nginx is working. Giving -v
to SSH showed that it was getting stuck on “pledging the filesystem,” but I don’t totally know what that means and
based on my unsuccessful Google searches this doesn’t seem to be a very common problem.
What Worked… I Guess…
Take this section with a grain of salt, because I can’t say I ever properly diagnosed the problem and figured out a fix for it like this Reddit commenter (whose comment I really wish I had found when I was going through this) did.
After 90 minutes or so of dead ends, I decided my best option was to spin up a second instance and migrate my boot volume to it. This was complicated by my Oracle Cloud Availability Domain (for the Always Free plebians, I guess) not having any more boxes of the same shape as mine available. I decided to migrate to a “Micro” shaped box and then eventually migrate back once I was able to snap up another Flex instance, although obviously I had my reservations about this plan.
I managed to get a backup of my instance’s boot volume, then powered off my instance and detached the boot volume. I made the Micro instance, powered it off and detached its boot volume. But then I couldn’t figure out how to swap the two boot volumes – even though the instances used the same image, maybe they would need to use the same shape, too?
At this point I decided to backtrace on my rash and hare-brained plan to significantly downgrade my box, so I terminated the new Micro instance, re-attached the boot volume to my main (Flex) instance and powered it on…
And voilà! Or something. For whatever reason, my instance was now back up and working properly. I don’t know if detaching and re-attaching the boot volume was what did it, or if that 4th or 5th reboot was the charm, or the backup process had some beneficial side effect, or if enough time went by that things magically got fixed. I’m certainly glad they did, though.
Naturally, customer support options are pretty limited on the Always Free tier. Personally, I’m not the kind of person to say, “this Always Free instance isn’t very reliable, so I guess I have to make a paid instance.” Instead I’m more likely to think, “this Always Free instance isn’t very reliable, and this is my only experience with Oracle Cloud, so their paid tier probably isn’t as reliable as Digital Ocean. So, I’d rather migrate back to Digital Ocean than pay for Oracle Cloud.” That’s just me, though. Maybe Oracle, being an Oracle, was able to cosmically sense that they weren’t going to get me to pay for some customer support, and that’s why things magically clicked into place. Who knows.
Although I’m skeptical that it will, I hope this datapoint can be of some use or comfort to somebody. The morals of the story:
- Beware of rebooting your Ubuntu Oracle Cloud instance. Keep things up to date, but don’t do it totally willy-nilly like I did. I guess. (This is a pretty dissatisfying moral.)
- As soon as things go sideways in any way, always use the force shutdown/force reboot option in Oracle Cloud, or you will end up waiting longer than the promised 15 minute maximum.
- Backing up, detaching, and then re-attaching your boot volume might help? Maybe? In any case, backing up your data never hurts!
- You get what you pay for and I, my friends, am a cheapskate.