Tuesday, March 29, 2011

Day 18: POSSCON

The open source conference in Columbia was a great experience for a student like me getting ready to graduate and trying to find an interesting company that uses database technologies.  One such company, BackType, makes use of database and data mining technologies in dealing with large data systems.  The speaker for BackType, Nathan Marz, was able to explain the eight properties of large data systems as follows:
  1. Robust
  2. Low latency reads and updates
  3. Scalable (horizontally, adding more machines as the data size grows)
  4. General (abstracting whenever possible)
  5. Extensible (able to add new features)
  6. Ad-hoc analysis (this is where the data mining comes into play)
  7. Minimal maintenance
  8. Debuggable
Marz also described dividing the system into two layers, the batch and speed layers.  Using a tool called Hadoop, one is able to create this structure and use message passing and filters to create incremental algorithms that check for false positives in creating batch views.  Although the batch layer is slow with high latency and high throughput, it contains the master copy.  The speed layer compensates for this by utilizing more complex algorithms and transient data (meaning that the data is discarded from the speed layer once it is passed to the batch layer).  In a sense, the two layers both gather data only to merge it all together to create a real-time view.  The databases associated with containing the data are mostly Read/Write databases and not the widely used relational databases that I have always worked with.  However, one can still use MySQL, for example, to query the database.  Backups, as well as full recoveries, can be done using the batch layer while the speed layer continues to append more data to it's log.

I was also able to sit down with two teachers from another school and a representative from Oracle.  I learned about Oracle's diverse range of software products, as the representative was saying one could find Oracle just about anywhere there is IT.  We also got into a discussion over the usefulness of Virtual Box, something I currently use to host a SQL Server 2008 server on Windows Server 2003.  The representative was very appreciative of the complements we had to say about the product.  The two teachers had mentioned an idea to make the software communicate with other instances of Virtual Box.  I only know about a feature in Virtual Box to communicate with the host computer, but not other instances.

As for our project, we have been able to simplify the problem by removing an inner portion to the linear regression equation.  Currently, we need only to calculate the weight of each food group for use in the linear regression equation.  Alex, with her skills using PyGTK, was able to reflect this change in the GUI in presenting it to us.

No comments:

Post a Comment