Changing Cardinality of Influxdb

John Wheeler
5 min readOct 16, 2021

In March of 2021 I published this article about loading pi-hole metrics into Influxdb with Telegraf. At the time of the writing I was aware of the possibility that I could impact the cardinality of the domain tag. This tag would contain every domain every requested. I vaguely remember reading about Influxdbs retention policy

The following sections cover how to create, alter, and delete retention policies. Note that when you create a database, InfluxDB automatically creates a retention policy named autogen which has infinite retention. You may disable its auto-creation in the configuration file.

So…. that could be a problem. If influxdb is recording the requested domain as a tag, and the database has an infinite retention, I’m pretty sure i’m gonna have a lot of tags.

I started to see errors politely warning me that Influxdb had exceeded it’s max-values-per-tag limit in August. Influx decided that it’s helpful warnings were not sufficient for me to take action so it it began to drop data points with messages like the following

Sep 23 23:57:14 pi-hole2 telegraf[32315]: 2021-09-24T04:57:14Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="guc3-spclient.spotify.com" dropped=33; discarding points
Sep 23 23:58:04 pi-hole2 telegraf[32315]: 2021-09-24T04:58:04Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="guc3-spclient.spotify.com" dropped=31; discarding points

I continued to ignore these messages (surely this will go away). When I finally began to research this error, I was looking through logs files and found a few errors that seemed a bit odd.

Sep 24 00:03:04 pi-hole2 telegraf[32315]: 2021-09-24T05:03:04Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="nwbkcmr.home" dropped=16; discarding points
Sep 24 00:04:04 pi-hole2 telegraf[32315]: 2021-09-24T05:04:04Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="koaxxlx.home" dropped=32; discarding points
Sep 24 00:05:04 pi-hole2 telegraf[32315]: 2021-09-24T05:05:04Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="asstkowsv.home" dropped=31; discarding points
Sep 24 00:06:04 pi-hole2 telegraf[32315]: 2021-09-24T05:06:04Z E! [outputs.influxdb] When writing to [http://192.168.1.26:8086]: received error partial write: max-values-per-tag limit exceeded (100008/100000): measurement="piholestats" tag="domain" value="hyptudegq.home" dropped=40; discarding points

Notice the domain value in the above errors

  • nwbkcmr.home
  • asstkowsv.home
  • hyptudegq.home

What were these? Malware? This finding diverted my attention from solving the original problem. I found this reddit post that points to this pihole posting that shed some light on this. It looks like this has been a thing for a while.

If you type in a single-word search query, chrome needs to send a DNS request to check if this might be a single-word host name: For example, “test” might be a search for “test” or a navigation to “http://test". If the query ends up being a host, chrome shows an infobar that asks “did you mean to go to ‘test’ instead”. For performance reasons, the DNS query needs to be asynchronous.

Now some ISPs started showing ads for non-existent domain names ( http://en.wikipedia.org/wiki/DNS_hijacking ), meaning Chrome would always show that infobar for every single-word query. Since this is annoying, chrome now sends three random DNS requests at startup, and if they all resolve (to the same IP, I think), it now knows not to show the “did you mean” infobar for single-word queries that resolve to that IP.

This post has a very deep dive into the code and extracted this comment from the code.

Because this function can be called during startup, when kicking off a URL fetch can eat up 20 ms of time, we delay seven seconds, which is hopefully long enough to be after startup, but still get results back quickly.

This component sends requests to three randomly generated, and thus likely nonexistent, hostnames. If at least two redirect to the same hostname, this suggests the ISP is hijacking NXDOMAIN, and the omnibox should treat similar redirected navigations as ‘failed’ when deciding whether to prompt the user with a ‘did you mean to navigate’ infobar for certain search inputs.

trigger: “On startup and when IP address of the computer changes.”

We generate a random hostname with between 7 and 15 characters.

Now that I understand this behavior I realize my problem of cardinality is exacerbated and I likely need to investigate retention. As an a short term fix, I decided to update influxdb and change the maximum value for tags.

Updating Influxdb

I’ve reduced my fears that my system has some nefarious malware, let’s look at how to increase the cardinality

The number of unique measurement, tag set, and field key combinations in an InfluxDB bucket.

Influxdb has an attribute that limits the number of unique values (cardinality) for a given tag. This attribute max-values-per-tag is currently set to 100000 by default.

I’ve moved my influxdb instance into Docker on QNAP. Let’s figure out where the config file is on disk to update this attribute.

The config file in the container is located at /etc/influxdb on disk I need to go update /share/CACHEDEV1_DATA/ssd/Container/container-station-data/lib/docker/volumes/influxdb_influxdb-etc/_data . User your favorite editor and update the parameter with a new value.

# cat influxdb.conf 
[meta]
dir = "/var/lib/influxdb/meta"
[data]
dir = "/var/lib/influxdb/data"
engine = "tsm1"
wal-dir = "/var/lib/influxdb/wal"
max-values-per-tag = 300000

I use container station to restart the container. I made this change back on September 24th.

Time passes……….

Reviewing my logging, I’m not seeing anymore of those messages.

Searching for error messages in Graylog

This looks promising. It’s been about three weeks and I don’t see that error. Looking at the grows over the last three weeks is another story.

# docker exec -it influxdb_influxdb_1 /bin/bash
root@40f77ff9d9a1:/# influx
Connected to http://localhost:8086 version 1.8.6
InfluxDB shell version: 1.8.6
> show databases
name: databases
name
----
rpi_monitoring
_internal
asus
> use rpi_monitoring
Using database rpi_monitoring
> show tag keys
........name: google_dns
tagKey
------
domain
host
rcode
record_type
result
server
name: piholestats
tagKey
------
client
domain
forward
host
status
type
.........> show tag values cardinality with key = "domain"
name: google_dns
count
-----
1
name: piholestats
count
-----
258711

So…my change has only bought me time. I’m already up to 258711 in only a few weeks. I’ll need to either change that value again, or look at aging out data, or transpose some data so that I’m not storing those useless 7 character domains.

I’ll publish a follow-up when I figure out how I want to handle this.

--

--

John Wheeler

Security professional, Mac enthusiast, writing code when I have to.