After a attempting to update the firmware in the usual manner, my Edgerouter Lite 3 failed to boot. This turned out to be due to a failure of the internal flash memory, which I have since found out is a common failure mode for this product.

My sequence of events leading up to the failure was:

  1. Backuped up the config
  2. Downloaded new firmware from Ubiquiti
  3. Uploaded firmware to router – error: failed to mount image
  4. Uploaded firmware to router – error: failed to mount image
  5. Refreshed web GUI, uploaded firmware to router – error: already updated, you must reboot to apply.

I already had a console cable that I had made up from half a CAT5 patch cable and a DE9 connector for a HP switch that I used to own. As luck would have it, this had the correct pinout for the Edgerouter, providing me console access.

Baud rate is 115200, apart from the choosing the right COM port, the default settings in Putty will work fine.

Powering up the Edgerouter with the console cable connected, the error looked like this:

Basically the USB device had failed and was not mounting, causing the boot loader to not find an operating system at the default boot up address 0x09f0000.

At this juncture, I saw 3 options:

  1. Spend $200 on a new router to get back online again, buying time to repair and/or RMA the old unit. This would result in a spare router and -$200.
  2. Go without internet for several days while doing an RMA on the unit
  3. Attempt to repair the unit today

I probably should have chosen option 1, but perhaps unsurprisingly (if you know me), I chose option 3.

How to unbrick the Edgerouter Lite 3

Warning: Steps from here on are unsupported by Ubiquiti, probably void your warranty and could damage your hardware. You should RMA your router instead. 

Also, see update below for an easier method…

The Edgerouter Lite 3 has a USB flash drive internal to it. For $12, I picked up a replacement USB 2.0 drive. The original was 4GB, the smallest replacement from a reputable brand was 16GB, in this case a Kingston, so I went with that.

Replacement is trivial, just be sure that the new stick is a short, skinny one (especially the sides). Here is a photo with the new stick fitted and the old alongside.

Next, I needed to load the firmware onto the new USB stick. I used a fantastic rescue linux tool called EMRK from http://packages.vyos.net/tools/emrk/

To load the recovery image, you need to first run a TFTP server.  I installed a great Windows TFTP server from http://tftpd32.jounin.net/

Connect a patch lead from ETH0 of the Edgerouter to the PC and manually configure an IP address on the PC’s ethernet card. I used 192.168.1.1

Next, run the TFTP server and use the dropdowns to point it at the directory containing EMRK and bind it to the ethernet IP address. I also turned off security in the TFTP pane, but this might not be needed. On the first attempt, I tried to use DHCP, but had issues – the Edgerouter would request an IP address (visible in the TFTP server logs, including an address being offered), but not understand the response. So I went ahead and manually configured its IP address. The commands to do this were:

 
set ipaddr 192.168.1.2 
set netmask 255.255.255.0 
set serverip 10.168.1.1 
set bootfile emrk-0.9c.bin

I then copied the file over tftp and executed it with the commands:

 
tftpboot boot 
octlinux 0x9f00000

This brought up the recovery OS.

As I already had a web server running on another machine on my network, so I preloaded a copy of the Edgerouter firmware onto the web server and followed the prompts to manually assign an IP address to the Edgerouter that would be compatible with my network. However you do it, you’ll need to connect ETH0 to a machine that has UBNT firmware available on it. I used HTTP but have read that FTP and possibly SFTP are also supported.  If you have an internet connection available, you could even load it from the UBNT servers. Use the command:

emrk-reinstall

After downloading the firmware, the recovery software automatically installs it.

Unfortunately, upon applying power, the router still had a very similar error message – it failed to find the new memory stick. Unlike with the old stick, typing “reset” would result in a successful boot up with this new memory stick. I discovered from https://community.ubnt.com/t5/EdgeMAX/New-U-Boot-image-for-better-USB-drive-compatibility/td-p/850744 that this was due to the new memory stick not waking up in time to be detected. Adding a USB reset and delay to the boot-loader’s boot up command resolved this. The boot command is an environment variable. Show the environment variables by issuing the “printenv” command.

I then assigned the old boot loader command to oldbootcmd – use a text editor to get it right for yours, then paste into the console. I then put a delay into the boot command, a USB reset, another delay and finally call the old boot command. Finally, save the environment. These are the commands that I issued – yours may need adjustment:

setenv 'oldbootcmd fatload usb 0 $loadaddr vmlinux.64;bootoctlinux $loadaddr coremask=0x3 root=/dev/sda2 rootdelay=15 rw rootsqimg=squashfs.img rootsqwdir=w mtdparts=phys_mapped_flash:512k(boot0),512k(boot1),64k@1024k(eeprom)' 
setenv 'bootcmd sleep 1; usb reset; sleep 10; $(oldbootcmd)' saveenv

Voila – the router now boots from cold power up and I am back online for $12 plus more time than I had hoped. From here, I set the PC back to DHCP, browsed to the web GUI and uploaded the backup of my config.

Tools:

References:

UPDATE: 2018-03-25

A better way

I’ve found an easier method for creating bootable USB sticks for the Edgerouter Lite. This will still require the environment variable change over the console, but removes the need to do TFTP.

The tool is called Make EdgeOS Image, and can be found at https://github.com/sowbug/mkeosimg

It is a linux script that takes an official Ubiquiti firmware image (as downloaded from Ubiquiti) and generates a disk image that you can low level copy to a USB stick using the dd command (see instructions in the readme or on the GitHub page).

The beauty of this tool is that you can throw in your latest backup file of your router config and the script will incorporate your settings into it. I made a habit of always downloading a backup copy of my config after each time that I make changes, and suggest that you do the same.

Memory sticks used

When I initially had the meltdown, I purchased a Kingston 16GB USB2 drive DTSE9H/16GB:

This works well, except for the caveat with the initialisation time requiring a change to the environment variable. It was the last one that they had, so I didn’t purchase a spare.

Today, I purchased a new stick to try out Make EdgeOS Image. I ended up buying a Kingston 16GB USB3 drive since I already had a working router and figured it would be good to check whether USB3 drives play nicely as they are becoming the norm. This stick is a DTSE9G2/16GBFR and it is longer than the USB2 model, but does fit the Edgerouter (only just!):

I don’t know if this stick shares the initialisation issue of its USB2 cousin as I didn’t try connecting a console cable and adjusting the boot-up delay.

SEE UPDATE BELOW – THIS STICK FAILED AFTER 6 MONTHS.

I did however measure the time from applying power, to being able to ping Google. With the USB2 drive, I was online in 2 minutes 10 seconds. With the USB3 drive, it took 2 minutes 50 seconds.

I had expected that the due to the flash memory in the USB3 drive being faster, that there may be a speed improvement (depending on whether the flash or USB2 interface was the bottleneck on the USB2 model). It was indeed suprising that the USB3 stick was so much slower. Either way, these routers are very slow to boot, so one must be patient for the harrowing few minutes to see whether a flash replacement was successful.

I now have a spare USB stick ready to go, should it ever be required. I’ve also researched some industrial grade USB drives, but have yet to purchase any. They are very pricey, about $80 for a 4GB, compared to the $11 that I paid for the 16GB commodity USB2 drive.

A theory about the failures

One of the fixes mentioned in firmware release v1.10.0 (which I was updating to when my router got bricked) piqued my interest:

[SNMP] – Improve snmp performance by moving cache from flash storage to tmpfs.

About 2 months prior to the failure (after 7 months of flawless operation), I had enabled SNMP on my router and configured Munin to monitor it to produce graphs. Munin collects data every minute. Although the change is cited as being for performance reasons, I’m suspicious that if SNMP was getting cached to the flash, and my Munin node was hitting it every minute, this may have contributed to abnormally high wear on the flash. What do you think?

Update 2018-08-11 – the Kingston stick failed

Attempting to log into my router to change a setting, I was asked to agree to a EULA. This is common after a firmware upgrade, but I’m reasonably sure I’d logged in since the last upgrade.

Upon submitting the ubnt login form (with agreement to the EULA), I received a 500 Internal Server Error. This was consistent across multiple browsers and computers. SSH access continued to work.

Inspecting the web server log at  /var/log/lighttpd/error.log  I saw errors in a python script that was trying to touch a file to store the fact that I’d agreed to the licence. Unfortunately this file ( /root.dev/www/eula ) could not be accessed by python. Indeed, I discovered that /root.dev appeared to be empty and also read-only – no www folder existed at all and I couldn’t create one.

Upon rebooting, the router failed to boot up, with similar errors to those at the beginning of this article. Swapping in my backup USB stick allowed me to boot up. Furthermore, with the latest mkeosimg from Github, I was able to re-burn my backup memory stick with the latest firmware and config file.

I would ask why this router is frying memory sticks, except that only hours ago I was in the store trying to return a failed Kingston RAM module… I’ll try to find a better quality USB stick tomorrow.

Update June 2024

So it’s been almost 6 years and my router is still going strong. My parents have one as well that’s also been going for several years now. My folks had their external 12V power supply die during covid lockdown but otherwise it’s been smooth sailing. (Luckily they had an external hard drive 12V dc power supply with a barrel jack to swap in).

I believe I know the ROOT CAUSE for this issue.

On version 1.10.0, released 15 Feb 2018, there is the following bugfix mentioned in the release notes:

[SNMP] – Improve snmp performance by moving cache from flash storage to tmpfs.This also fixed random kernel crashes when SNMP updating cache in tight loop

While they may have done this for storage, I believe since I was monitoring my infrastructure via SNMP every 5 minutes, the router was thrashing the flash and burning it out by rewriting the SNMP data to it so often. Tempfs is a ram based virtual drive.