dbms-notes: writing blocks to disk: dbms

Recently I have been trying MongoDB and other NoSQL databases for a project.
So I decided to compile some of my notes on posts here.

MongoDB: Download, Install and Configuration

Installing MongoDB on Ubuntu

To install MongoDB on Ubuntu, you can use the packages made available by 10gen, following the steps below:

(1) add a line to your /etc/apt/sources.list

deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen

(2) Add the 10gen GPG key, or apt will disable the repository (apt uses encryption keys to verify if the repository is trusted and disables untrusted ones).

jdoe@quark:/etc/apt$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
Executing: gpg --ignore-time-conflict --no-options --no-default-keyring --secret-keyring /etc/apt/secring.gpg --trustdb-name /etc/apt/trustdb.gpg --keyring /etc/apt/trusted.gpg --primary-keyring /etc/apt/trusted.gpg --keyserver keyserver.ubuntu.com --recv 7F0CEB10
gpg: requesting key 7F0CEB10 from hkp server keyserver.ubuntu.com
gpg: key 7F0CEB10: public key "Richard Kreuter " imported
gpg: no ultimately trusted keys found
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
jdoe@quark:/etc/apt$

(3) To install the package, update the sources and then install:

$ sudo apt-get update
$ sudo apt-get install mongodb-10gen

(4) Create directory for datafiles and database logs.
MongoDB by default try to store datafiles in /data/db.
If the directory does not exist, the server will fail to start unless you explicitly assign a different, existing location for the datafiles.
For security reasons, make sure the directory is created as a non-root user.

(a) You can create the default directory: 
$ sudo mkdir -p  /data/db/
S sudo chown `id -u` /data/db

(b) you can choose to store datafiles somewhere else. If you choose this, make sure to specify the datafile location with the --dbpath option when starting the MongoDB server.
$ sudo mkdir -p  
S sudo chown `id -u`

(5) You can test the installation, by calling the mongodb shell

jdoe@quark:~$ mongo
MongoDB shell version: 2.0.1
connecting to: test
>

The installation creates the mongodb user and install files according to the default configuration below:
Installed architecture

binaries installed on /usr/bin

jdoe@quark:/usr/bin$ ls -l mongo*
-rwx... mongo            -- database shell
-rwx... mongod           -- mongodb daemon. This is the core database process
-rwx... mongodump        -- hotbackups. creates a binary representation of the entire database, collections or collection objects
-rwx... mongoexport      -- exports a collection to JSON or CSV
-rwx... mongofiles       -- tool for using GridFS, a mechanism for manipulating large files in MongoDB
-rwx... mongoimport      -- imports a JSON/CSV/TSV file into a MongoDB
-rwx... mongorestore     -- restores the output of mongodump
-rwx... mongos           -- sharding controller. Provides automatic load balancing and partitioning
-rwx... mongostat        -- show usage statistics (numbers and percentuals) on a running monodb instance 
-rwx... mongotop         -- provide read/write statistics on collections and namespaces in a mongodb instance

Configuration file installed on /etc/mongodb.conf

database files will be created in: dbpath=/var/lib/mongodb
log files will be created in : logpath=/var/log/mongodb/mongodb.log

Starting up and Stopping MongoDB

mongod is MongoDB core database process. It can be manually started to run in the foreground or as a daemon.
There are a number of options with which mongod can be initialized.
The startup options fall into general, replication, master/slave, replica set and sharding categories. Some of the startup options are:

--port      num   TCP port which mongodb will use
--maxConns  num   max # of simultaneous connections
--logpath   path  log file path
--logappend       append instead of overwrite log file
--fork            fork server process (daemon)
--auth            authenticate users
--dbpath    path  directory for datafiles
--directoryperdb  each database will be stored in its own directory
--shutdown        shutdowns server

(a) start mongodb running in the foreground in a terminal. Data stored in /mongodb/data. mongodb uses default port 27017.
(You need to create the /mongodb/data first).

jdoe@quark:~$ mkdir -p /mongodb/data
jdoe@quark:~$ mongod --dbpath /mongodb/data  
...
Sun Nov  6 19:05:09 [initandlisten] options: { dbpath: "/mongodb/data" }
Sun Nov  6 19:05:09 [websvr] admin web console waiting for connections on port 28017
Sun Nov  6 19:05:09 [initandlisten] waiting for connections on port 27017
...

jdoe@quark:~$ ps -ef | grep mongo
jdoe   20519 16142  0 19:05 pts/1    00:00:10 mongod --dbpath /mongodb/data
jdoe   20566 20034  0 19:49 pts/2    00:00:00 grep mongo

jdoe@quark:~$ ls -l /mongodb/data
total 4
-rwxr-xr-x 1 jdoe jdoe 6 2011-11-06 19:05 mongod.lock

(b) start mongodb as a daemon, running on TCP port 20012. Data stored in /mongodb/data. Logs on /mongodb/log.

jdoe@quark:~$ mongod --fork --port 20012 --dbpath /mongodb/data/ --logpath /mongodb/logs/mongodblog --logappend 
forked process: 20655
jdoe@quark:~$ all output going to: /mongodb/logs/mongodblog

Alternatively, you can start/stop mongoDB by:
jdoe@quark:~$ sudo start mongodb
mongodb start/running, process 2824

jdoe@quark:~$ sudo stop mongodb
mongodb stop/waiting

Stopping MongoDB

Contro-C will do it, if the server is running on the foreground. Mongo waits until all ongoing operations complete and then exits.
Alternatively:

(a) call mongod with --shutdown option

jdoe@quark:~$ mongod --dbpath /mongodb/data --shutdown
killing process with pid: 20746

or
(b) use database shell (mongo)
(Here note the confusing output of the db.shutdownServer() call. Although the messages suggest failure, the database is shutdown as expected).

jdoe@quark:~$ mongo quark:20012
MongoDB shell version: 2.0.1
connecting to: quark:20012/test
> use admin
switched to db admin
> db.shutdownServer()
Sun Nov  6 20:23:59 DBClientCursor::init call() failed
Sun Nov  6 20:23:59 query failed : admin.$cmd { shutdown: 1.0 } to: quark:20012
server should be down...

(Extracts from a good discussion posted by James Hamilton in his blog. You can check the original article and comments here.)

Although a couple of years old, James Hamilton provided a good requisite-based breakdown of data storage systems. These are some of his points:

The world of structured storage extends far beyond relational (Oracle, DB2, SQL Server, MySQL, NoSQL, etc) systems.
Many applications do not need the rich programming model of relational systems and some are better seviced by lighter-weight, easier-to-administers, and easier-to-scale solutions.

Structured storage approaches can be classified based on customer major requirements.
These are Feature-first, scale-first, simple structured storage and purpose-optimized stores.

(1) Feature-First

Traditional Relational database management systems (RDBMS) are the structured storage system of choice here.
Driven by requirements for Enterprise financial systems, human resource systems, customer relationship management systems (FIN, HR, CRMs)

Examples here include Oracle, MySQL, SQL Server, PostgreSQL, Sybase, DB2.

Cloud solutions here include:

Amazon RDS (Relational Database Service) is a cloud-based solution that basically makes availble from the cloud the functionality of Oracle or MySQL databases.
Microsoft SQL Azure is in the same line.
Oracle Public Cloud, just launched at the 2011 Oracle OpenWorld

(2) Scale-First

This is the domain of very high scale website (i.e. facebook, Gmail, Amazon, Yahoo, etc)
Scaling capabilities are more important than more features and none could run on a single rdbms.
The problem here is that the full relational database model (including joins, aggregations, use of stored procedures) is difficult to scale (especially in distributed contexts).
Distributing data across tens to thousands of rdbms instances and still maintain support for the distributed data as if it were under a single rdbms engine is difficult.

key-value store

HBase

Amazon SimpleDB

Project Valdemort

Cassandra

Hypertable

(3) Simple Structure storage

Applications that have a structure storage requirement but do not need features, cost and complexity of RDBMSs neither have very high scalability requirements.
Some implementations include:
Facebook: email inbox search (Cassandra)
Amazon: retail shopping card (Dynamo)
Berkeey DB

(4) Purpose-Optimized stores

Mike Stonebraker argued that the existing commercial RDBMS offerings do not meet the needs of many important market segments
Some special purpose real-time, stream processing solutions (StreamBase, Vertica, VoltDB) have beat the RDBMS benchmart by +30x...

Readings:
Mike Stonebraker, One Size fits all

Pages

Installing MongoDB on Ubuntu

Structured Storage approaches