Thursday, September 18, 2014

Observation: A Key Debugging Tool

When I was writing about attraction turnstiles yesterday, one of my favorite troubleshooting stories came to mind. And this is one situation where the data and system logs didn't give us any insight. It was only when I went out to observe with my own eyes that the problem came to light.

Different Technologies
As I mentioned last time, I worked at a major theme park and was part of the project to convert from manual turnstiles to automated turnstiles (more of the story). As another part of the project, we were moving away from mechanical turnstiles (the familiar 3-bar clicky things like in the picture) to optical turnstiles. Optical turnstiles use infrared light beams to count people walking by. These are usually invisible to the people walking by (unless you're someone like me who specifically looks for them).

Optical turnstiles are a bit less accurate than the mechanical ones primarily because they don't act as a barrier, and so people walk through them differently. As expected, there was a concern by the operating area that the optical turnstiles would not be accurate enough for their needs. But after much research, implementation tests, and on-site trials, it was determined that the accuracy was adequate for the business needs. And of course, one of the big benefits was that they were invisible and provided a better experience for our customers.

Data Discrepancies
At most locations, the optical turnstiles performed as expected. But at one location, they had regular inconsistencies. This particular location happened to be a playground for kids -- lots of cargo nets to climb, caves, towers, bridges, rock climbing walls. And the entrance and exit were combined.

Because of the nature of the location, the turnstiles counted in both directions, meaning they counted the people coming in as well as the people going out. This allowed us to keep a "running total" of how many people were in the location at any one time.

But there was a problem. This running total never went down to zero. It would gradually creep higher and higher during the day. At the end of the day, it would show several hundred people who went in but never came out. Now this obviously wasn't the case (unless there was a portal to another dimension somewhere in the caves -- but we didn't have any complaints about missing persons (no, not that Missing Persons (I still miss the 80s))).

Initial Troubleshooting
This sounded like a technical problem, so we got right on it. I checked the raw data from the location for a several week period. There were 4 different "lanes" (separate turnstiles). I looked for gaps in the data that would indicate that a sensor went offline. I looked to make sure we got counts in both directions (both entry and exit counts) from all of the lanes.

There wasn't anything obvious in the data, so it was time to move on to Stage 2: verifying the counts manually.

Observation
I grabbed by trusty person counter (like the one pictured), synchronized my watch with the server, got my clipboard, and headed out to the location.

I would not have been surprised to find a problem with the optical sensors. Since the sensors worked on infrared light, they could be "blinded" by the sun. If they were hit by direct sunlight, they wouldn't be able to make any counts at all. Since this was an outdoor location, I expected I might see some of that.

I was also there to observe behavior of both the customers and employees. If it turned out that an employee would stand in front of one of the sensors, it could affect the counts. We had seen locations were ropes were put up that blocked the sensors or swinging flags would cause the sensors to count even when no people were passing through.

Since there were 4 lanes at the location, I prepared myself to be there for a while. I need to count at least one 15-minute interval for each lane, and I also wanted to count the entry and exit separately. When my watch indicated the beginning of an interval, I started manually clicking off the people using the first lane to enter.

Realization
It was only by standing out there for 2 hours that my brain started to register what the real problem was. This was a kid's play area. It was a place where parents would take their children to expend their excess energy. Kids were climbing over everything, running (even though it wasn't allowed), jumping, swinging, and generally tiring themselves out.

And what happens when kids get tired? They want their parents to carry them. And that's exactly what was happening. Children were walking in to the location (triggering an entry count) and being carried out (*not* triggering an exit count).

While I was watching this (and still clicking away), I did some quick calculations in my head:
Assuming 50% children and 50% adults, if 10% of the children walked in and were carried out, it would account for the discrepancy that we were seeing in the numbers.
So, I had the answer even before verifying my manual numbers. I completed my counts and went back to my desk. And I found exactly what I expected to find: the manual counts that I took matched the automated counts on the server.

I was only able to identify the problem by making observations with my own eyes. I didn't find what I was looking for. But by being at the location and seeing the normal behavior, I was able to figure out what was happening.

Wrap Up
Not all problems with our systems are technical. Sometimes there are human elements involved. Only by watching how people actually use the system can we determine these types of issues. Sometimes it's an easy fix. If I see that one of my users wants to click on "Step 2" before completing "Step 1", then that tells me I need to hide some things until Step 1 is completed -- guiding the user to success in every part of the application.

In the case of the turnstile discrepancy, there wasn't much we could do to alter the customer behavior. We did add a "Minus 1" button at the location. If the employee noticed kids being carried out, she could press the button to trigger an exit count. This wasn't a great solution since the employees are usually doing other things like talking to the customers and answering questions. Other technical solutions would be overly complex and expensive for this particular implementation.

But having an answer to the question was really key in this case -- this made the operations folks confident that the optical sensors were dependable from a technical perspective, and they could rely on the system as a whole.

We can't always find problems from looking in logs and checking data. Sometimes we need to go out and see things with our own eyes.

Happy Coding!

No comments:

Post a Comment