The Great Namecoin Mining Brownout of June 17-20, 2018: Postmortem

Recently, many Namecoin mining pools experienced an outage, causing transaction confirmation to slow down to 2.1 blocks/hour. We’ve fixed the issue, and mining is back to normal. This post summarizes what we know about the outage.

The Cause

On June 17, an unknown person appears to have built a pair of name_new and name_firstupdate transactions with the raw transaction API, and broadcast them both to the Namecoin P2P network at the same time. This is not something that the normal Namecoin Core GUI or RPC API will let you do, because revealing your name_firstupdate transaction before the name_new has been confirmed allows 3rd parties to front-run your registration. However, there’s nothing that prevents users from doing this via the raw transaction API. Namecoin Core’s consensus rules require name_firstupdate transactions to wait 12 blocks after name_new before they can be mined; any name_firstupdate transactions whose name_new input doesn’t yet have 12 confirmations will be admitted to the memory pool but will not be mined. Unfortunately, there was a bug in Namecoin Core’s mining code: while it was correctly choosing not to mine name_firstupdate transactions whose name_new input had between 1 and 11 (inclusive) confirmations, it was erroneously trying to mine name_firstupdate transactions whose name_new input had zero confirmations. As a result, getblocktemplate was building an invalid block, and then returning an error upon detecting that the block was invalid, which DoSed the mining pools.

Our Initial Response

Cassini reported on June 18 that, during the prior 24 hours, only ViaBTC and BTC.COM had mined any Namecoin blocks. He wasn’t sure what the cause of the outage was, but he did note that it didn’t seem to be related to BTC versus BCH mining, since ViaBTC was still using both parent chains for Namecoin mining as usual. The obvious thing to do was to contact the mining pools to figure out whether they were seeing any errors on their end. Jeremy began contacting the mining pools. While we were waiting for mining pools to respond, we tried to analyze possible failure modes. Cassini speculated that maybe an accidental consensus fork had occurred, and wondered whether the two pools who were still online had changed anything about their setup recently (e.g. updating their Bitcoin Core or Namecoin Core client). Jeremy noted that clearly the pools who were still online hadn’t accidentally activated a hardfork, since Jeremy’s Namecoin Core node from 2 years ago was still accepting their blocks. To verify whether an accidental softfork had been activated, Jeremy asked Redblade7 (who runs a Namecoin seed node) to check the output of namecoin-cli getchaintips. The output revealed that no forks had been observed since June 3 (and that fork was only a single orphaned block), which made it clear that no softfork had occurred. (In addition, the two pools who were still up composed a minority of the usual hashrate, which again pointed to this not being an accidental softfork.)

Response from Miners

Wang Chun from F2Pool was the first mining pool operator to get back to us; he said F2Pool was investigating the issue from their side. Shortly afterward, F2Pool’s mining came back online and mined 6 blocks. Shortly afterward, crackfoo of zpool posted on GitHub saying that he was repeatedly getting getblocktemplate errors that day. crackfoo posted this in a GitHub issue that was fairly old: mining pools had occasionally been receiving that error for well over a year, but it had been difficult to reproduce. However, the last post in that thread prior to crackfoo’s was from Yihao Peng of BTC.COM on June 17, saying that he had gotten the error that day and had fixed it by clearing his memory pool. Yihao Peng’s post was the first one that provided debug logs, which allowed us to analyze what was happening. Daniel quickly figured out that the issue was caused by the unconfirmed name_new and name_firstupdate pair, and Jeremy speculated that this was probably the reason for the outage.

Immediate Fixes

crackfoo asked whether there was any way to exclude the name_firstupdate from block creation as a temporary fix while we were waiting for a Namecoin Core update. Jeremy suggested using the prioritisetransaction RPC call to prevent the name_firstupdate from being included in blocks. Jeremy then tested this locally, and verified that it fixed mining; Jeremy sent out a message to the Alerts mailing list notifying mining pools that they could restore service by doing this. Jeremy also suggested using prioritisetransaction to accelerate the name_new transaction, since once it got mined, the problem would go away (even for mining pools who didn’t do anything). Yihao Peng from BTC.COM offered to do so, and successfully mined the name_new transaction on June 20. At this point, we observed the rest of the mining pools come back online.

Proper Fixes

Daniel has fixed the relevant behavior in Namecoin Core’s block construction code; the fix is present in both the master and 0.16 branches. Miners are encouraged to upgrade so that this situation can’t happen again.

Analysis

Obviously, any kind of mining disruption is a bad thing, since it causes transactions to confirm more slowly and also increases exposure of the Namecoin network to potential 51% attacks. In particular, when only 2 mining pools are mining blocks for a period of a day, the larger of the two pools obviously has the capacity to double-spend transactions. We’re not aware of any reason to believe that any double-spend attacks occurred, which is consistent with our general experience that the Namecoin mining pools behave ethically and try to help us fix issues. (AuxPoW researchers Paul Sztorc and Alexey Zamyatin have come to similar conclusions about Namecoin mining pools.) In particular, BTC.COM (the pool that would have been capable of double-spending) was also the pool who reported debug logs to us and accelerated the name_new transaction to fix the problem for the other pools.

Generally speaking, our incident response went quite well. In particular, service stayed up and running throughout the 3-day incident, although it took ~3x longer to get transactions to confirm than usual. Compared to the previous notable outage, from 2014, where the network was totally unusable for 2 weeks, things certainly went a lot better this time around. However, we wouldn’t be doing our jobs if we didn’t propose some additional steps we can take to further improve:

Streamline the process of contacting mining pools. We already have an alerts mailing list that was set up long ago, but the alerts list is ill-suited to cases where we suspect a problem with a particular mining pool and don’t want to spam the other pools with noise. Improving this is already on our radar, due to some upcoming softforks (yes, we’re finally activating P2SH and SegWit soon!) on which we need to closely coordinate with mining pools (and other service providers, such as exchanges, registrars, block explorers, ElectrumX operators, inproxies like OpenNIC, and analytic websites like CoinMarketCap, BitName.ru, and Blockchain-DNS.info).
Automated monitoring. Significant progress was made on this since the 2014 outage. In particular, we have a free software script that can calculate hashrate distribution, and Cassini runs this script regularly. However, that script is not maintained as well as it should be (both Cassini and Jeremy have some private changes that haven’t yet been merged due to lack of time to devote to it), and it would be beneficial to document exactly how to run the script in an automated fashion, with things like email or Matrix alerts when suspicious events occur. We’re currently in the middle of re-evaluating our CI infrastructure (yes, we are aware of the GitHub buyout by a PRISM member and ICE supplier, and we’re not happy about it), and this is definitely one area that we’ll be exploring. Coincidentally, several of our developers recently obtained a Talos II from Raptor Computing Systems (yes, they are awesome, you should support Raptor), so this may give us additional infrastructure options.

And, to complement the above, some things that aren’t critically needed for this particular issue:

Quick release of binaries for emergency fixes. While this would be highly useful for other reasons (and we’re exploring this possibility for CI infrastructure), we’ve been sufficiently successful at making Namecoin Core build reliably from source that the mining pools usually build Namecoin Core themselves. At the time of the 2014 outage, Namecoin was a pain in the rear to build and a lot of mining pools were using our binaries. Not a problem anymore.
Adopt support for Matt Corallo’s decentralized pooled mining protocol. While this would have substantial benefits in terms of both total hashrate and hashrate diversity, the problem at the root of this issue is a matter of (1) effective communication with miners and (2) attentive miners, both of which are made mildly worse by a more diverse hashrate distribution. We do intend to pursue this as a long-term goal (it’s a hardfork, meaning that adopting it is a major bother), and we think it’s highly important for Namecoin to improve in this area, even though a more diverse hashrate is less effective for communication and attentiveness.
Contingency plans for developers who are busy with external obligations. Although more redundancy is always helpful, we currently have some developers who have sufficient funding for Namecoin to be their primary project. This was not the case in 2014, when our primary responders were dealing with unrelated business trips or school coursework. As a result, response to critical issues is substantially more reliable now than it was 4 years ago. Don’t get me wrong, we still need more developers, and we still would benefit from more funding – but things are clearly moving in the right direction.

Thanks

The following people helped respond to the outage:

Cassini
Jeremy Rand
Daniel Kraft
Yihao Peng (BTC.COM)
crackfoo (zpool)
Wang Chun (F2Pool)
Redblade7
Luke Dashjr