Moving Beyond the Noise by Filtering Internet Pseudo Services
About a year ago, Censys began scanning public IPv4 addresses on 2,000 ephemeral ports in addition to ports with popular IANA-assigned services. We’ve found that an astounding number of services run on unassigned ports. We also found that many protocols run on ports assigned to other services.
In response, we’ve changed our approach to scanning so that we don’t make any assumptions about what protocols run on each port. Instead, we try to detect the protocol and dynamically change the type of handshake we complete. As a result, our Universal Internet Dataset has grown significantly and today, the majority of services in the dataset are found on non-standard ports. However, while investigating unexpected services, we’ve found that many of the services on ephemeral ports aren’t “real” services, which we treat differently.
Cutting Past the Noise of Internet Anomalies
At our core, Censys tracks the services that run on every public IPv4 address. As can be seen above, the vast majority of IPs host only a small number of services. Indeed, 60% of IPs in our dataset only have a single port open and only 1% of hosts have more than 15 ports with listening services. Yet, despite this, we noticed that nearly 40% of the services we uncover run on the 0.2% of hosts with more than 100 responsive ports.
Initially, we thought that these hosts might be network honeypots: software that pretends to be a legitimate host running fake services in order to detect and record scanners, connection attempts, intrusions, etc. However, we find that the services on these “super” hosts are not unique. Rather, the services tend to all be running the same protocol. In most cases, they respond with identical—or nearly identical—HTTP content. So what are these hosts?
We begin our exploration in BigQuery — a tool that lets us run SQL queries against the Universal Internet Dataset. Researchers can replicate our results through the Censys Research Access program.
Let’s start by analyzing these “super” hosts in relation to the network they belong to.
The query below extracts all of the live services on the Internet from a snapshot on February 2, 2021 and groups them by origin Autonomous System (AS); then, it compares that number of services in each AS to the total number of services. Finally, it compares the number of pseudo services in each AS to the total number of pseudo services in the snapshot.
DECLARE total_services INT64; DECLARE total_pseudo_services INT64; SET total_services = ( SELECT SUM(ARRAY_LENGTH(services)) FROM `censys-io.universal_internet_dataset.universal_internet_dataset` WHERE DATE(snapshot_date) = "2021-02-01" ); SET total_pseudo_services = ( SELECT SUM(ARRAY_LENGTH(services)) FROM `censys-io.universal_internet_dataset.universal_internet_dataset` WHERE DATE(snapshot_date) = "2021-02-01" AND ARRAY_LENGTH(services) >= 100 ); SELECT autonomous_system.asn AS asn, autonomous_system.name AS name, COUNT(*) AS total, COUNT(*) / total_services as percent_of_total, COUNT(*) / total_pseudo_services as percent_of_pseudo FROM `censys-io.universal_internet_dataset.universal_internet_dataset` JOIN UNNEST(services) AS service WHERE DATE(snapshot_date) = "2021-02-01" # service.truncated indicates the service belongs to a “super” host AND service.truncated GROUP BY asn, name ORDER BY COUNT(*) DESC;
Immediately, we see some interesting results. A single AS, Incapsula (AS 19551), is responsible for over 20 percent of all of the services Censys catalogs on the Internet! Furthermore, services on their network make up over 60 percent of all pseudo services.
We sample a few IP addresses from Incapsula’s network to investigate further. In the Censys dataset, 22.214.171.124, 126.96.36.199, and 188.8.131.52 have 1346, 1060, and 1402 ports open respectively. On all but one or two open ports, each IP serves the following webpage.
Imperva’s error documentation states that this code is generated when a client attempts to connect to an IP without using a valid name. We quickly run an experiment to try and validate this behavior.
❯ dig docs.imperva.com ;; ANSWER SECTION: docs.imperva.com. 227 IN CNAME 4xu3l6t.x.incapdns.net. 4xu3l6t.x.incapdns.net. 30 IN A 184.108.40.206
Loading the IP address 220.127.116.11 directly in a browser presents the familiar Error 22 page, validating that the error is from the absence of a name. We conclude that these hosts are a part of Imperva’s Web Application Firewall.
Of these, research by Izhikevich et al.  suggests that they are primarily made up of middleboxes and user-space firewalls. Often, these middleboxes and firewalls respond on hundreds or even thousands of ports in an effort to thwart scan or attack attempts.
We randomly sample a handful of IP addresses from the set of “super” hosts identified earlier and see familiar content. For example the IP 18.104.22.168 will respond with a 400 error code and the following html content on almost every single port.
Similar variations are seen on other hosts, like 22.214.171.124.
Even outside of Incapsula’s network, the vast majority of “super” hosts appear to be made up of cache servers which will respond to HTTP requests across many non-standard ports.
We find other types of pseudo services, however. Another class more closely resembles honeypots. For example, the IP 126.96.36.199 appears to be running Portspoof: software that emulates real services running on all 65,535 TCP ports.
Portspoof and similar software is used to make it expensive for attackers to perform reconnaissance and identify the services a host is actually running. In their documentation, they state it can take scanners up to 8 hours just to complete a single 65k port scan against a host! While security through obscurity is a long rejected technique, Portspoof forces malicious entities to spend significant resources to find out what’s actually running on a host.
Regardless of their implementation, pseudo services pose a problem in our data because they make up such a large percentage of total services, yet belong to only 0.2% of hosts. Additionally, as we’ve seen with Portspoof, pseudo services may not reflect a real service that is actually running on a host.
How Censys Handles Pseudo Services
Pseudo services generate undue noise for the amount of value they provide. For example, counting the total number of HTTP services we observe on the internet can vary by over 50% depending on whether or not they are included in the query.
To help customers filter pseudo services, we’ve added a flag called
truncated to our service records. This flag indicates that the service has had some of its structured data truncated and is running on a “super” host.
SELECT COUNT(*) FROM `censys-io.universal_internet_dataset.universal_internet_dataset` JOIN UNNEST(services) AS service WHERE DATE(snapshot_date) = "2021-02-01" # Filter out pseudo services AND service.truncated = false AND service.service_name = 'HTTP';
Additionally, pseudo services are refreshed at a slower rate than other higher value services. This way, Censys is able to dedicate more resources toward discovering and refreshing the assets our customers are most interested in.
Censys Universal Internet Dataset
The Censys Universal Internet Dataset (UIDS) is the industry leading dataset of hosts and services on the Internet. Organizations use UIDS to track sophisticated threats and defend complex attack surfaces. Get access to the Universal Internet DataSet and discover “super” hosts and much more. Contact us and request a demo today!
Are you interested in doing research? We also provide access for researchers. See if you qualify here.
 LZR: Identifying Unexpected Internet Services. Liz Izhikevich, Renata Teixeira, Zakir Durumeric. USENIX Security Symposium 2021.
Hudson Clark is a Software Engineer at Censys. He is focused on building systems at scale and is passionate about delivering value and insights to customers. Hailing from Ann Arbor, he holds a Bachelor’s Degree in Computer Science from the University of Michigan.