The challenges that are involved in storing and processing of the data are listed below:
Large volumes of data:
Large volumes of data:
It is estimated that the data in the world is around 2 zettabytes(10^21). The social networking giant Facebook hosts 10 billion photos, taking up one petabyte of storage. Just the logs that are generated by Facebook site itself occupy terabytes of storage per day.
Processing:
How to process the petabytes of data?
RDBMS systems are capable of storing data in the order of GBs to TBs. The RDBMS systems are not built for processing huge amount of data.
What about grid computing?
In grid computing, the data is stored on single san storage. For processing, the data is moved to N number of machines (grid), where the computation is done. The problems with grid computing are:
- Lot of data movement over the network between the storage and computation machines. Just to move 1TB of data it will take 2 hours of time.
- The developer has to write program to handle the mechanics of data flow, coordination between the computing machines, handling machine failures.
RDBMS can store only structured data in the form of tables. It imposes schema on the data. You cannot store unstructured data in the RDBMS system.
Transfer rate:
It is the rate at which you can read data from the disk. Around 1990s, the size of hard disk was 1.2GB and transfer rate was 4.4MBPS. Now to read 1GB of data from the disk, it takes 5 mins.
Now in 2013, we have disks of size 1TB and transfer rate of 100 to 150MBPS. To read the whole 1TB of data from the disk, it takes around 2 hours 30 mins.
You can observe that there is a drastic growth in the storage capacity but not on the transfer rate. Two and half hours is a long time to read one terabyte data from the disk.
Seek time:
Seeking is the process of moving the disks head to a particular place in the disk. Seek time is improving more slowly compared to the transfer rate.
In an RDBMS system, if you want to read a row in a table of 1 billion rows (assuming there are no indexes on the table), then it has to find the row by reading the data from the start of the data file. This where the seek time causes problem.
Velocity:
The rapid growth of data is causing the data storage issues. An RDBMS system scales up. This means you have to replace the hard disk with bigger one to accommodate the new data.
It was really a nice post and i was really impressed by reading this Big Data Hadoop Online Course
ReplyDelete