[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
We need to be 10,000 times faster.
aah@nofs.navy.mil wrote:
>
.
> Michael and Chris had lots of problems with Postgres
> and the Mark III, so I am hesitant on suggesting a
> similar exercise with 100x more data. As I've said
> before, I intend to make a 'clean' database here with
> just those objects with both V&I detections,
I learned something in that exercise. The problems we had
were not with Postgres and Linux. They work fine. Well at
least there is nothing that works much better.
Looking at it now I'd
say the problem was in what we call "The concept of operations".
This is the human procedures that surround to use of any
software. Yes we did have some problems with my implementation.
My first attempt has to use SQL and Perl for catalog matching.
This was at best "dead dog slow". Next I used C and did the
matching operation using flat files then imported the matched
data back into Postgres. This gave (just) acceptable
performance and was ~100x faster.
Still I think the whole concept was wrong. Today I'll ask
myself "could this process scale to 10,000 times the data
rate if I had the money to throw at it? The current process
would not. Doing so would require getting Michael a computer
10,000 times faster. No amount of money could buy that.
If the process is to scale there must be a way to put 10,000
PCs to work.
On the camera and data reduction to star lists front we scale
well. Each camera has one operator and a few PCs If Tom where
to build 200,000 systems instead of "only" 20 we could still reduce
all the data to star lists given the current method.
Now suppose Michael gets 200,000 CD-ROMs mailed to him every day.
He'd get rich by selling the plastic to a recycler put that's about it.
I think the way out is to adopt what works for data reduction. That
is to decentralized the process and add a person and computer power
each time you add a camera system. But we want a _central_ database
that we can search. We don't want to have to search 200,000
databases.
I think we have to redefine what a central database is. First off
it's capacity will be constrained by the resource (time and money)
limits of its operator. Let's say it has capacity = X. Now if
we have N camera sites, each with an amount of data = S, then
and optimal plan is for each site to send in (on average) no more
than the "best" X/(N*S) data. By my definition this would be the
best we can do.
You could say that designing a system that could scale by a factor
of 10,000 is unreasonable but no. We are to slow by on order 10x
today. It takes Michael so long to build a database that he only
does it every few months at best. We need to be much faster than
we are. Now we know the data rate will go up by a factor of 100.
We need to have a system (system = software plus human procedures)
that is 1000x faster. I only missed by one order of magnitude.
So what to do? It depends on what the values for X, S, and N turn
out to be. They will be in constant flux but we can guess. Also
what to do depends on how you define "best data". I doubt we
will all agree on what defines "best".
Here is how I'd like to address the problem and get us going
1000X faster: 1) Every sites maintains a database like
Michael's current database but only populated with data from
the local site. You can use Linux and Postgres or Windows and
flat files I don't care (my vote would go to Linux and MySQL)
2) next, everyone periodically (weekly?) computes a version of
tenXcat Choose parameters such that the size of the tenXcat
catalog and supporting raw data is about X/(N*S). Maybe to do
this you build a twentyXcat or fiveXcat, I don't know. 3) You
send your "nXcat" along with supporting raaw data to the central
site.
This system _does_ scale. If we had 200 Mk IV systems and
"X" remained constant. Michael would still not be more over
worked then at present _but_ the quality of his data would
go up with the number of Mk IV systems added as each site would
be forced to more harshly edit its data.
Now I will toss in one more idea. We could have more than one
"central" database. A contradiction of terms? No. Each
would have a different interest. Let's say Michael likes to
collect quality light curves so he invites camera operators to
harshly edit their data to select for only low error photometry.
Now Let's say I want to discover as many variable stars as I
can using the combined data from all TASS cameras. I don't
care about well observed bright non-variables and I have (due to
limits on my time and budget) only capacity X[i] (read "X
sub i") to hold data. So I ask invite all camera sites to send me
harshly edited data selected for many observations and high sigma
magnitude. Some third database operator could ask for _all_
data within a set of small bounding boxes of ra, dec. Who knows
what for. (asteroid or nova hunting?)
I think this plan gives us what we want: A set of specialized
(to varying degrees) databases that each drwa input from all
TASS cameras. The plan also scales to any number of cameras.
Also it is not a lot of work to implement. The database software
only need to be written once and can be re-used at each central
site and at each camera site. The databases really _are_ all
alike. Each is specialized only by the content it holds. Camera
site databases hold only data from their own site (plus a system
wide catalog) and the central databases hold only data that is
of interest to the operator of that database.
Now if the above "concept of operations" seems like the way to go
then we need to make up one more standardized interface. That
is the database to database transfer method. I doubt all databases
will be Linux/Postgres so the standard can not depend of any of
that system's built-in methods. We will need a file format. I
will propose FITS ASCII tables.
--Chris Albertson home: chrisja@jps.net
Redondo Beach, California work: calbertson@logicon.com