Observability in the Lightning network
Lightning network growing up - from #reckless to professional
Since the invention of the modern electronic payment systems in the 1950s, the world has witnessed a normalization of electronic payments in our daily lives. Over the years, this adoption has been driven by several factors, either technical, economic, or societal. The fundamental reason for mass-market adoption is due to its ability to transfer value faster and over longer distances than previous means. One thing that's always been missing is interoperability. Due to the closed nature of these networks, you are most commonly only able to transfer value within the same network or maybe between two. With the lightning network, those constraints are a thing of the past.
Lightning network is open for everyone who wants to participate as long as they have an internet connection. Having an open distributed payment network doesn't come without its own complexities, and while the reasons for them, privacy and openness, make the lightning network good, we must be aware of them.
What defines a good payment network?
The first one is obvious - the ability to transfer value. But being able to transfer value does not mean you want to spend 3 minutes staring apologetically at your barista while you figure out that your private node is down, switch to a secondary wallet and hope that they do a better job than you of keeping the channels balanced, wait for your wallet to find a path to this coffee shops node that actually works and then finally pay for what is by now only lukewarm coffee. This brings us to two other vital properties - predictability and reliability.
We want to be reassured that our every payment will go through. Nobody wants a lightning network equivalent of “sir, your card was declined,” which is reliability, but we also want to have a predictable speed with which it works so we are not forced into unnecessarily long social situations that we're all avoiding after the last two years. Let's define both of those properties a bit clearly.
Reliability is an ability of a system to perform a function without failure. It is also one of the key properties of any payment system we'd like to use. Standards of reliability in value transfer were set with cash. It works every time and everywhere. Everything invented later is just trying to get close to it. We all agree that the lightning network as a whole is an excellent piece of technology, but for it to replace a significant portion of the world's payment volumes, reliability will have to be on top of our minds.
Predictability is a system property that enables the user or machine to rely on the expected outcomes of a system, given its current state. But more importantly, we think it's crucial for adoption and user experience. Users generally expect a better experience if they are going to switch to new technology or platform. And while lightning payments really can be lightning fast, that is not always the case - your or your peer's liquidity might be lacking, your wallet might need a bit longer than usual to find a route etc. With improved visibility and better, more informed pathfinding, this can be a thing of the past.
Above sets the stage for the premise of this article - observability of the lightning network. For the purpose of this article, we'll focus on the above-mentioned properties and leave others for another time.
What is this observability you are talking about?
The origins of the word observability started decades ago with control theory, but in more recent times, it is increasingly applied to distributed IT systems and their performance improvements. In IT systems, we use different types of telemetry data - metrics, logs, and traces, to provide visibility into our systems and enable teams to uncover complex bugs and eliminate performance bottlenecks. Part of that translates to the lightning network itself - if our node is not working, then no matter how balanced our channels are, how well connected our peers are and how cheap are the fees to our destination, our payment will not go through. But our node(s) working is where this story only begins.
At the start of this article, we defined the ability to transfer value as the primary utility of a payment network. This means that besides our side, the counterparty receiving the value we want to transfer must be working too, and in the case of the lightning network, all hops on our path to the destination as well.
In transmitting payments or routing them, two things matter the most. First, how much capacity do you have available to either accept (inbound capacity) or send payments (outbound capacity), and who are you connected to (open channels). Those two factors primarily define your ability to transact on the network and in the case of routing nodes, earn. But as with many things in life, the devil lies in the details. Therefore, if you want to have the best experience transacting on the lightning network, you should pay a lot more attention to many factors:
Your node’s positioning in the graph (which depends significantly on your use case)
are you a merchant who will mostly transact with users in a small region and maybe pay some local vendors in bitcoin
are you an LSP trying to provide global services to clients
are you a routing node trying to generate yield on bitcoin
etc…
Your peer’s position and connectivity (and peers of your peers)
your 2-3 hop neighborhood is something that will significantly factor into your payment success rates and pathfinding attempts, thus impacting predictability and reliability
if all your channel peers run on residential connections, their availability and reliability can become a problem as people rarely have redundant power lines and internet connectivity at home, not to mention that hardware dies
if all your channel peers are geographically close together or all of them run at the same cloud provider etc, you might find yourself without a lot of liquidity when providers experience issues (AWS reported incidents bringing down entire regions or availability zones in the last 12 months), or natural disasters hit, and there are wide power outages
Higher latency in routing and pathfinding
while tor network provides a lot of benefits in terms of privacy and avoiding censorship, it does bring some downsides as well - speed and latency are not exactly great, which can cause a lot of friction if you're doing high volumes of transactions (high latency on routing/pathfinding, timeouts causing locked capital and user dissatisfaction)
even with the best connectivity and hardware, roundtrip New York and Singapore or Sydney can be in the order of a couple of 100 milliseconds which adds up. If you end up routing your payment on the around the world trip several times, even the speed of light won't make it unnoticeable
Peers and neighborhood liquidity (you might be able to push large payments to your peers, but they can only have two channels large enough to handle them)
you might have great wumbo channels that are perfectly rebalanced, but what about your peers? Your 10BTC channel won't be of much use if two hops out all the nodes on your path have channels smaller than 2m sats. Sure, you can use MPP but wouldn't you rather plan ahead and make sure you deploy your capital in the best possible way and avoid additional complexity if not necessary?
Above is just an introductory list of important factors to consider, data points to monitor and changes to possibly implement. What we aim to do at bolt.observer is provide you insights into the observable universe that is the lightning network, enable you to remove part of the complexity of running your nodes, improve your flows and give you back some of your precious time to spend it on building new amazing things in this great ecosystem.
As is with all things, one has to start somewhere. Where we started is a product that everyone running a node can use - monitoring the reachability of their lightning node and, with that ability to transact on the network and route payments. We all consider ourselves great node operators with exceptional skills, but no matter how blessed our lives are, bad things still do occasionally happen. From hardware failure or misconfigured firewall to power and internet outages, no matter what happens, knowing that it happened as soon as possible is always a good thing.
Reachability of your lightning node is the first service we, at bolt.observer, are releasing into the wild, but far from the last. We're gonna be releasing tools making the lives of everyone using lightning, from LSPs, merchants, plebs to sophisticated node operators, easier, providing access to data to give you better insight into the lightning network, help you deploy your bitcoins more efficiently, providing you with real-time monitoring and enabling you to build your solutions on top of our data platform, either by integrating with our APIs or just using our data to make more informed decisions.
Using the right tools helps you be a more efficient node operator and gives you more time to focus on serving your clients. Make sure you check out bolt.observer and feel free to reach by email, Twitter or Telegram for any question.
If you’d like to read more about our views on financial metrics in lightning network check out our previous article - Lightning network financial metrics