March 3, 2012

Review of a research paper “How To Build a High-Performance Data Warehouse”

Review of a research paper “How To Build a High-Performance Data Warehouse”

(Download reference research paper)

  • Three ways are highlighted by the author to handle scalability issues and get high performance at low cost:

    Following three ways highlighted to handle scalability issues and get high performance at low cost:

    1. Shared Memory
    2. Shared Disk
    3. Shared Nothing

    In “How To Build a High-Performance Data Warehouse” all of these are in use and author finally suggested to use shared nothing architecture and also listed few organizations which are using other architectures. These are implement-able according to situations. Because shared memory means both storage disks and RAM etc. will be the same for all processors. But if ever all processors will read or write in such amount which will be bottleneck bandwidth of data bus so it means shared memory approach is being utilized 100% and approach should be enhanced.

    Shared disk approach is better than shared memory because here every processor will have its own memory and can cache results and queries for some extent. But read/write limit with respect to bandwidth of bus is still a constraint here.

    Shared nothing does not mean that there processors are not having anything for them. It means that they have memory and set of disks but fully available for them. And these processors can be single core or multi core, can cost from $10k to $700k so they are affordable. Author depicted that Google is one of those vendors who are using this approach.

  • Three approaches of parallelism are used for better performance, and the more suitable one in my view:

    Three approaches of parallelism are:

    1. Shared Memory
    2. Shared Disk
    3. Shared Nothing
    • How shared memory can be used for better performance?

      This approach is successful where multiprocessor machines are available and data traffic is limited which can be managed with the limitation of data bus bandwidth. It gives better performance instead of implementing more disks and memory, a single disk serves the multiprocessors with single memory.

    • How shared disk can be used for better performance?

      This approach is better than shared memory approach because this architecture provides independent processor nodes with their own memory instead of shared one. But they access the single collection of disk. Again here if data traffic is limited and there is not issue “shared-cache” implementation feasibility so this approach can be used for better performance as author depicted that vendors implement “shared-cache” design to make it work better. Obviously when a single disk will be serving to multiprocessors having their own memory so instead of accessing disk each times it would be better to cache few results. And this approach is suitable for OLTP but not for data warehousing because OLTP related to summarized results and if required then further digging but data warehouse means detailed results so shared cache won’t be an addition to performance.

    • How shared nothing can be used for better performance?

      This approach is better than shared disk approach because here instead of single disk for multiprocessors each processor has its own set of disks. Initially it is available with horizontally portioned data across nodes which are cost effective and good for star schema queries. And further author suggested it with combination of hardware acceleration and scalability through software. But obviously the base is shared nothing approach where single core or multi core PC is available in affordable prices.

    • Which approach is more suitable?

      Shared nothing approach is more suitable because it is basically available with horizontally portioned data but can be upgraded to vertically partitioned data. Author presented this dramatic change as 10 times better according to situation. For OLTP it work perfectly with horizontally partition data and for data warehousing it can be amended as it will be required.

  • Supportive Reasons for my views:

    1. Cost effective: Because instead of panicking about data bus bandwidth limitation it provides independent PCs with their own memory and set of disks.
    2. Most Scalable: As shared memory and shared disk are least and medium scalable but it is most scalable in comparison.
    3. Upgradeable: It can be upgraded to vertical portioning database architecture for 10 times better performance.
    4. Fast & Reliable: Instead of increasing disk space or installing more set of disks, you can use compression and instead of spending on per byte you can spend on processor which will compress and decompress the data quickly.
Last updated: March 19, 2014