Skip to content

UNDERSTANDING OBSERVABILITY VS. MONITORING. PART 1

The development of clouds, the DevOps movement, and distributed microservice-based architecture have come together to make observability vital for modern architecture. We’re going to dive into what observability is and how to approach the metrics we need to track.

Observability is a way of spotting and troubleshooting the root causes of problems involving software systems whose internals we might not understand. It extends the concept of monitoring, applying it to complex systems with unpredictable and/or complex failure scenarios.

I’ll start with some of the basic principles of observability that I’ve been helping to implement across a growing number of products and teams at Nord Security.

 

observability

 

Monitoring vs. Observability

“Monitoring” and “observability” are often used interchangeably, but these concepts have a few fundamental differences.

Monitoring is the process of using telemetry data to understand the health and performance of your application. Monitoring telemetry data is preconfigured, implying that the user has detailed information on their system’s possible failure scenarios and wants to detect them as soon as they happen.

In the classical approach to monitoring, we define a set of metrics, collect them from our software system, and react to any changes in the values of these metrics that are of interest to us.

For example:

Excessive CPU usage can indicate that we need to scale it up to compensate for increasing system loads;

A drop in successfully served requests after a fresh release can indicate that the newly released version of the API is malfunctioning;

Health checks process binary metrics that represent whether the system is alive at all or not.

Observability extends this approach. Observability is the ability to understand the state of the system by performing continuous real time analysis of the data it outputs.

Instead of just collecting and watching predefined metrics, we continuously collect different output signals. The most common types of signals – the three pillars of observability – are:

  • Metrics: Numeric data aggregates representing software system performance;

  • Logs: Time-stamped messages gathered by the software system and its components while working;

  • Traces: Maps of the paths taken by requests as they move through the software system.

The development of complex distributed microservice architectures has led to complex failure scenarios that can be hard or even impossible to predict. Simple monitoring is not enough to catch them. Observability helps by improving our understanding of the internal state of the system.

Metrics

Choosing the right metrics to collect is key to establishing an observability layer for our software system. Here are a few different popular approaches that define a unified framework of must-have metrics in any software system.

USE

Originally described by Brendan Gregg, this approach focuses more on white-box monitoring – monitoring of the infrastructure itself. Here’s the framework:

  • Utilization – resource utilization.

    • % of CPU / RAM / Network I/O being utilized.

  • Saturation – how much remaining work hasn’t been processed yet.

    • CPU run queue length;

    • Storage wait queue length;

  • Errors – errors per second

    • CPU cache miss;

    • Storage system fail events;

Note: Defining “saturation” in this approach can be a tricky task and may not be possible in specific cases.

Four Golden signals

Originally described in the Google SRE Handbook, the Four Golden signals framework is defined as follows:

  • Latency – time to process requests;

  • Traffic – requests per second;

  • Errors – errors per second;

  • Saturation – resource utilization.

RED

Originally described by Tom Wilkie, this approach focuses on black-box monitoring – monitoring the microservices themselves. This simplified subset of the Four Golden Signals uses the following framework:

  • Rate – requests per second;

  • Errors – errors per second;

  • Duration – time to process requests.

Choosing and following one of these approaches allows you to unify your monitoring concept throughout the whole system and make it easier to understand what is happening. They complement one another, and your choice may depend on which part of a system we want to monitor. These approaches also don´t exclude additional business-related metrics that vary from one component of the software system to another.

Logs

System logs are a useful source of additional context when investigating what is going on inside a system. They are immutable, time-stamped text records that provide context to your metrics.

Logs should be kept in a unified structured format like JSON. Use additional log storage/visualization tools to simplify interaction with the massive amount of text data the software system provides. One very well-known and popular solution for log storage is ElasticSearch.

Traces

Traces help us better understand the request flow in our system by representing the full path any given request takes through a distributed software system. This is very helpful in identifying failing nodes and bottlenecks.

Traces themselves are hierarchical structures of spans, where each span is a structure representing the request and its context in every node in its path. Most common tracing visualization tools like Jaeger or Grafana display traces as waterfall diagrams showing the parent and child spans caused by the request.

Conclusion

Building an observable software system lets you identify failure scenarios and possible risks during the whole system life cycle. A combination of metrics, extensive log collection, and traces helps us understand what’s happening inside our system at any moment and speeds up investigations of abnormal behavior.

This article was just the first step. We’ve covered the standard approaches to metrics and briefly discussed traces and logs. But to implement an observable software system, we need to set up its components correctly to supply us with the signals we need. In part 2, we’ll discuss instrumentation approaches and modern standards in this field.

About Version 2 Limited
Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About Nord Security
The web has become a chaotic space where safety and trust have been compromised by cybercrime and data protection issues. Therefore, our team has a global mission to shape a more trusted and peaceful online future for people everywhere.

How CISOs Can Stretch IT Security Budgets

The global annual cost of cybercrime is now an eye-watering $6 trillion. To put this into perspective, if cybercrime were a country, it would be the world’s third-largest economy after the US and China 

The cybercrime landscape has changed dramatically over the last decade. For example, ransomware was 57 times more destructive in 2021 than in 2015. The average cost of data breaches continues to rise every year. Moreover, the COVID-19 pandemic has changed how we work – more people are working remotely and from their own devices. This means cybersecurity teams have less insight into what employees are doing, and as a result, Shadow IT is becoming an even bigger problem.  

But how do chief information security officers (CISOs) navigate this increasingly hostile cyber threat landscape in a world where IT security budgets are tightening? With the US economy on the brink of a recession, cybersecurity budgets are tighter than ever. As a result, CISOs need to do more with less and develop a new and robust IT security strategy. That’s what we’re going to be diving into today.  

Ways to Stretch IT Security Budgets

1. Get More From Your Existing Tools

As the number of data breaches has skyrocketed over recent years, so have the technologies we deploy to stop them. For example, the average small business uses between 15 and 20 IT security tools, while medium-sized companies use 50 to 60, and enterprises use over 130 IT security tools. But how many of these companies are using their cybersecurity tools to their full potential?  

It’s a good idea to evaluate and consolidate your existing cybersecurity tools. For example, you might find that one tool can do everything another tool can do or that you have a significant overlap in functionality across your arsenal. Getting rid of redundant tools not only saves money but also makes it easier to manage your cyber threat landscape. Or in other words, the more tools you have, the higher the probability of misconfigurations, patch management issues, and privileges and password management issues.  

If you’re unsure just how far specific tools can go, you can ask the vendor for free or low-cost training to help fill in the gaps. Moreover, opening a line of discussion with your IT security vendors can also give you valuable information about what tools can offer heightened protection in the future. For example, you might find that one vendor is imminently about to release a new security feature that addresses a critical security concern in your industry.  

2. Choose Automated Tools

Automation has come a long way in cybersecurity, and it’s even more potent today with cutting-edge technologies like artificial intelligence and machine learning. With automation technology, IT security systems can sense, study, and stop cybersecurity threats automatically and before they escalate into a fully-fledged security incident. Today we see automation, AI, and machine learning deployed across security tools, including network security tools like Network Penetration Testing tools, Network Intrusion Detection Systems, and in other areas like vulnerability management, security logging, and Security Information and Event Management (SIEM).  

However, it’s critical to note that most cybersecurity experts don’t recommend leveraging automation to replace staff. Automation can boost efficiency and reduce human errors, but it’s no match for a highly skilled security professional. Essentially, by investing in automation, your existing cybersecurity staff become freed up to work on more complex tasks.  

3. Make Your Case for More Funds

Getting the funds you need to provide effective network security can be challenging. As a CISO, you’re competing with other senior-ranking IT staff for your fair share of the IT budget.  

According to a Deloitte report, around 6% to 14% of the IT budget goes to cybersecurity for the average business. So, if your team is getting significantly less than this, you might want to consider why. Are your budget decision-makers unconvinced of the need for cybersecurity? Do they have doubts about its effectiveness? And what can you do to prove that more upfront investment is substantially cheaper than a costly cyber attack? 

When you go into budget discussions, you must have a good grip on the data and any upcoming concerns in the industry. For example, during COVID-19, we saw a massive spike in ransomware attacks. And today, Crime-as-a-Service (CaaS) tools are dramatically lowering the barrier to entry for would-be hackers. So much of cybersecurity is about anticipating your opponent’s move and being prepared before they strike. This means you have to pay attention to emerging trends just as much as current threats when detailing your cybersecurity budget.  

4. A More Creative Approach to Staffing

Employees will always be a dominant part of your IT security strategy, but they also make up a significant percentage of organizations’ IT security budgets. So, how do you ensure you’re spending your money wisely while getting the IT security skills you need? 

First, you need to set your sights beyond your local area. Skilled cybersecurity professionals are in high demand, but the talent pool is small. Moreover, the cybersecurity skills gap continues to widen every year. In the era of remote working, CISOs have never been in a better position to recruit security workers from different geographical areas.   

And on the point of the cybersecurity skills gap, companies need to be more creative in combating this issue. What do we mean by this? Well, many HR teams have a poor understanding of the skills or qualifications needed to be an effective IT security worker. As a result, they might filter out candidates without specific qualifications despite this being easy to remedy with training.  

You can recruit people with practical skills or look for people with these skills in-house. For example, technical aptitude, problem-solving skills, attention to detail, communication skills, fundamental computer forensics skills, and a desire to learn are crucial skills that often take a back seat to a specific certification in the recruiting process.  

 Additionally, you might find it’s more cost-effective to outsource parts of your cybersecurity function than to build the perfect team in-house.  

Final Thoughts on IT Security Budgets

The consequences of not investing in robust IT security are clear – costly fines, successful data breaches, and hefty reputational losses. CISOs know this, and so do the wider IT function. However, with an economic downturn looking ever more likely, CISOs will have to get more creative with their cybersecurity budgets or risk being left even more vulnerable.  

About Version 2 Limited
Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About Portnox
Portnox provides simple-to-deploy, operate and maintain network access control, security and visibility solutions. Portnox software can be deployed on-premises, as a cloud-delivered service, or in hybrid mode. It is agentless and vendor-agnostic, allowing organizations to maximize their existing network and cybersecurity investments. Hundreds of enterprises around the world rely on Portnox for network visibility, cybersecurity policy enforcement and regulatory compliance. The company has been recognized for its innovations by Info Security Products Guide, Cyber Security Excellence Awards, IoT Innovator Awards, Computing Security Awards, Best of Interop ITX and Cyber Defense Magazine. Portnox has offices in the U.S., Europe and Asia. For information visit http://www.portnox.com, and follow us on Twitter and LinkedIn.。

Did Iranian Hackers Cause The Fire At An Israeli Power Plant?

Almost immediately after a fire broke out in an active power plant in southern Israel on July 14, 2022, an Iranian hacking group claimed responsibility. While it’s understandable why the group, which goes by the name #Altahrea, would want to boost their hacker profile by saying they caused the fire, there is ample evidence that they actually had nothing to do with it. 

The Orot Yosef power plant, part of the Edeltech group, is located in Ramat Hovav, Israel and has been in operation since 1989. 

Orot Yosef Power Plant

To understand why we believe this fire was not the work of hackers, let’s take a look at how this plant operates and what might have happened to cause the fire. (SCADAfence’s security team research lead Yossi Reuven also spoke about the attack to Techmonitor.ai)

Gas turbines can be used in conjunction with steam boilers by passing hot gasses from the boiler through a gas turbine to produce mechanical drive for electricity generation. This combined arrangement is commonly referred to as “cogeneration.” Cogeneration is thermodynamically the most efficient method for generating electrical power, and it is the method used by the Orot Yosef facility. 

Why is this important? Understanding the process used by a facility is crucial to determining what event took place. Gas turbines require a correctly ratioed air-to-fuel mixture to operate. Running a turbine too rich or too lean, (too much air or too little air, respectively) can cause significant damage to the turbine. This means that if someone with malicious intent were able to compromise the air handling and run the turbine at maximum output with a lean mixture there is a good chance of detonation, overheating, loss of power, and damage to the turbine. These issues would all relate to the turbine housing and be far more catastrophic of an event.

We know that GE turbines were purchased and installed in the plant in 1989 as you can see in the image below from the Global Energy Observatory. (The GEO is a publicly available database of global energy information)

GEO entry for Orot Power Plant

The Power Plant Fire 

Shortly after the fire began, the Iranian hacker group #Altahrea posted a photo on Telegram of a fire that looks to have started in the building known as the, “Air Filter House”.

Most of the technology that resides inside the filter house is there to detect if the system is clogged. When a clog happens, it triggers the shutdown of the turbine to protect it from too much debris passing through the filter system, which can shorten the lifespan of the turbine.

Fire is a major risk for filter houses that have poor maintenance cycles. If filters are not replaced routinely, particulates and debris build up and all it takes for the filter cartridge pairs to go up in flames is a single spark. 

Based on open-source intel, it is likely that this facility is running an Electrostatic Precipitator.Power plant information from open source database

An Electrostatic Precipitator is typically used for pollution control to remove dirt from flue gasses in exhaust systems. Due to the fact that this facility has the ability to use Diesel as a secondary source of power generation, it is possible that an ESP could be present.

Another detail that provides relevant information is a redacted picture of Shodan.io’s Industrial Webcrawler revealing a Phoenix Contact EMpro PLC running a Webserver exposed to the internet as shown below.

Shodan.io shows information on the Phoenix Contact EMpro

The EMpro is used to measure voltages and current in a power supply system. The measure is used primarily to manage critical load balancing across a system and not for any critical process control of the filter house. If the device were to be compromised it would only allow an individual to carry out relatively small actions, and this is only in the event that the device had the Digital Output wired up.

This all begs the question, is it possible that a remote monitoring device was compromised in a way that allowed an adversary to trigger a discharge inside the filter house which then ultimately triggered a fire. Possibly. However it would require ideal conditions for this to happen and would also require a lapse in maintenance with a buildup of debris etc. I would expect that the same level of probability would occur if someone discarded a cigarette that was still lit and the filter house consumed it into the filter cartridge stage. In this case, that is a more likely cause of the fire, and not the Iranian hackers who claimed credit. 

To learn more about how the SCADAfence Platform can protect your OT network request a demo today.

About Version 2 Limited
Version 2 Digital is one of the most dynamic IT companies in Asia. The company distributes a wide range of IT products across various areas including cyber security, cloud, data protection, end points, infrastructures, system monitoring, storage, networking, business productivity and communication products.

Through an extensive network of channels, point of sales, resellers, and partnership companies, Version 2 offers quality products and services which are highly acclaimed in the market. Its customers cover a wide spectrum which include Global 1000 enterprises, regional listed companies, different vertical industries, public utilities, Government, a vast number of successful SMEs, and consumers in various Asian cities.

About SCADAfence
SCADAfence helps companies with large-scale operational technology (OT) networks embrace the benefits of industrial IoT by reducing cyber risks and mitigating operational threats. Our non-intrusive platform provides full coverage of large-scale networks, offering best-in-class detection accuracy, asset discovery and user experience. The platform seamlessly integrates OT security within existing security operations, bridging the IT/OT convergence gap. SCADAfence secures OT networks in manufacturing, building management and critical infrastructure industries. We deliver security and visibility for some of world’s most complex OT networks, including Europe’s largest manufacturing facility. With SCADAfence, companies can operate securely, reliably and efficiently as they go through the digital transformation journey.