sprockets, sockets, grommets & gaskets (randomdreams) wrote,
sprockets, sockets, grommets & gaskets

I ran a test for about six days straight, that generated 300MB of data or so, that I needed to filter down. My first filter, a simple delete-unnecessary-junk excel macro, pared that down to 50MB. But then I had a problem, that I'm still trying to handle.
I have four independent variables, that I'm sweeping through.
For each in (a-list):
For each in (b-list):
For each in (c-list):
For each in (d-list):
do stuff and measure stuff.

That's somewhere between 100K and 1M data points. I'm not really sure.
The issue is that at some point in each D sweep, an event happens, that results in a measurable outcome, and I have to figure out what everything else was, in the measurement before the event happened. This is complicated by sometimes happening before the first measurement (in which case there simply is no answer) and sometimes happening after the last event (again no answer) but once everything is in a giant excel spreadsheet, it's very difficult to tell a conditional formula or macro "hey only pay attention to events that happen within a d-sweep, not ones that happen at the boundaries between a, b, or c sweeps." Sometimes the event happens multiple times during the sweep, like it's in its no-event state, then in its event state, then back to no event, then back to event. All the things I've managed to write so far look at the overall pile of data, and trigger off every no-event-to-event transition, regardless of multiple events, of across-multiple-sweeps events, blah blah.
When I get done filtering I have a bunch of sparse columns: thirty or fifty blank rows followed by a row full of stuff. It took me a while to figure out how to condense that down. (In excel, the result of a formula is always some value, even if it's "" or #N/A or whatever: it may look empty but it is not empty, so I had to mess around a little to figure out how to choose all contents except for numbers and then delete those.)
Then I have this big mass of condensed, somewhat good data, only maybe 2000 lines of stuff or so, that I have to go through by hand and choose which ones to delete because they're clearly bad data, and then I have to go through that, and break it down by one of the sweep criteria and graph it.
There are so, so many places for mistakes of the copied-the-wrong-value sort here, and so many places where I have to make judgments about what constitutes good or bad data.
I'm trying to apply some basic statistics, stuff like throw away everything more than 3 sigma out, for instance. But I'm not sure that's valid. I'm not sure my assumptions about outliers are good, or about duplicates. My coworker, who might have better ideas, keeps looking at these giant masses of data, and saying "that's interesting" and getting off on tangents about how I'm measuring all these things and never answering my questions about data quality. My manager keeps sending me email asking me for finished analyses but never answers my questions about what constitutes good assumptions. (I've just started sending him my analyses with all the questions in there, so he has to read them to get to the conclusions.)

This is stuff I somewhat love. I suppose if I include all my questions and detail all my assumptions, that's the best I can do. It'd be nice to have more direction. But it's interesting learning a lot about data quality.

This entry was originally posted at https://randomdreams.dreamwidth.org/64550.html. Please comment there using OpenID.
Tags: #n
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.