This just happened:
Since this is on a production server you need to recreate it in a testing environment. Since I’m partial to vagrant I stand up 4 agent nodes and a master via pe-build vagrant plugin. My Vagrantfile looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
This error occured on an ActiveMQ install that works as the message que for a 1000 node deployment of puppet agents. To get terminology straight here, we have puppet agents and AMQ agents running on these 1000 nodes. They’re all qued from a singular AMQ broker.
My first impression was that this error may be caused by having 1000 agents pinging a single AMQ broker, which is limited to 800 via fds instances.
I check locally on my test master:
1 2 3 4
My soft limit is 1024 open files, and after rando .jars and logs and stuff that really feels like more around 800. So this is on the money, as far as what the docs say for ActiveMQ servers per broker.
How am I going to recreate what a 1000 node enviro looks like?
I’m limited to my laptop, 16GB of memory, and I’m too lazy to stand up 1000 instances in AWS (and to poor). So an attempt has to be made to recreate this memory error on my 5 nodes running locally.
Given the above information, I start open a shell, and ssh into my master:
1 2 3
Why 3? Because:
So a quick ‘service pe-activemq restart’ and bang…
Oh shit, new PID, new proc instance. Damnit. I have to figure out something else to fake the resource limits here.
Since ulimit commands are shell-bound I can ssh into the master from another shell and run:
1 2 3 4 5
The trick here is getting close enough to the lowest possible resource limits with out getting the
3 was actually too low, so I ran with 10 and was able to run a restart. You have to account for other files that may be bound to the PID instance, like logs and .jars since this is a bunch of java. And as everyone knows, Java is basically the pig of programming languages.
So 10 fds worked and activemq has restarted. Let’s look at my logs to see what I’ve got:
That isn’t what I wanted to see. I am looking for heap memory errors. So this clearly demonstrates it’s not an fds constraint at the filesystem level. Time to move on to other possibilities.
- Consider increasing the total heap memory available to the broker JVM
- Consider reducing the default JVM stack size of each thread using -xss
- If your broker is embedded ensure the hostiong JVM has appropriate heap and stack sizes.
- Broker memory is not JVM memory, it’s only a constraint - the broker manages it’s own memory.
- Setting appropriate systemUsage memory: http://activemq.apache.org/producer-flow-control.html#ProducerFlowControl-Systemusage
- Hard limits exist on the number of agents a single broker can handle due to file descriptors and other hard system resources
Check your log for current JVM heap size:
INFO | jvm 1 | 2014/02/26 12:47:04 | Heap sizes: current=506816k free=487246k max=506816k
Try bumping this up to 1GB in
If you still get
INFO | jvm 1 | 2014/02/26 12:47:38 | Exception in thread "ActiveMQBrokerService[ppm.prod.dc2.adpghs.com] Task-58" java.lang.OutOfMemoryError: unable to create new native thread
then throttle up your systemUsage in
per this guideline
Hard Limits of AMQ
If you still get OOM errors you may be at a hard limit for agents per broker. ActiveMQ uses the amqPersistenceAdapter by default for persistent messages. Unfortunately, this persistence adapter (as well as the kahaPersistenceAdapter) opens a file descriptor for each queue. When creating large numbers of queues, you’ll quickly run into the limit for your OS.
However, your logs will not register a OOM error as above, they’ll show
ERROR | wrapper | 2014/03/13 03:32:39 | JVM exited while loading the application. INFO | jvm 4 | 2014/03/13 03:32:39 | Error: Exception thrown by the agent : java.rmi.server.ExportException: Listen failed on port: 0; nested exception is: INFO | jvm 4 | 2014/03/13 03:32:39 | java.net.SocketException: Too many open files
If that is your error your could try upping the limit on file descripters per process. You can do something similar to what I did above or Google for your OS.
At this point if none of the above resolved the issues you should try standing up a second broker, especially if you’re running more than 1000 agents on a single broker instance.