Language Wrangling: Running Google’s Sawzall on Quantcast’s Mapreduce Cluster

In: Computers and Technology

Submitted By rl10
Words 749
Pages 3
Language Wrangling: Running Google’s Sawzall on Quantcast’s MapReduce Cluster
Motivation

It’s April 2011. Quantcast is running one of the largest MapReduce clusters out there. Engineers write Java code that gets executed across the whole cluster efficiently on terabytes of data. Wonderful!

Unfortunately, not everyone who needs access to data is an engineer. Also, even engineers don’t feel like writing a new MapReduce job every time they want to take a different look at a data set.

We realized we needed a more productive data analysis tool. The first thought was to get SQL to run on our petabytes of data, but that seemed like a large undertaking, as SQL data access semantics are fairly sophisticated, which implies a fairly large friction area with Quantcast’s MapReduce implementation. Another solution, simpler, was to get Google’s recently open-sourced Sawzall to run on Quantcast’s MapReduce cluster. Although as a language Sawzall is not as easy to use as SQL, especially to non-engineers, it’s still much simpler than Java. And it seemed reasonably easy to integrate Sawzall with Quantcast’s MapReduce implementation, as its interface to MapReduce is much narrower and better defined than SQL’s.

Challenge

This theoretical “ease of integration with MapReduce” turned out (surprise!) harder than expected in practice. First, Sawzall was not open-sourced with a MapReduce harness, but only as a compilation/execution engine plus a command line tool with little practical utility. Second, Sawzall runs best on protocol buffers but Quantcast stores its data in different binary and text formats. Third, Quantcast’s MapReduce although based on Hadoop lacked streaming capability because it was branched off an old Hadoop version that predates streaming capability.

Big Picture

We integrated Sawzall execution with our cluster software by writing a generic MapReduce…...

Similar Documents

Cluster

...actively seize the opportunities arising from this trend. Based on the article, cluster is a group of firms and institutions of one industrial sector that are complementing each other along a value chain and also overlapping in a limited geographical area. Clusters are considered to increase the productivity with which companies can compete, nationally and globally. Clusters are also very important aspects of strategic management. Business cluster gives many benefits such as productivity benefits, innovation and higher profitability compared to their isolated competitors. The producers that located within clusters can more easily concentrate on their core competencies and increase productivity. In clusters, it is easier for companies to recruit suitable employees and for employees to specialize in terms of their education. However, a wage-spiral may arise in a very dynamic cluster if employees frequently switch from one firm to another. The significance of employees moving from company to company becomes clearer still when viewed as a mechanism for knowledge exchange. People take their knowledge with them to their new jobs, combining it with the knowledge acquired at their new firms and thus developing the common knowledge base further. This provides an explanation for research findings showing that a few selected centers are host to most of the innovations in an industry. Companies anchored in such clusters can yield, on average, higher productivity than isolated......

Words: 1354 - Pages: 6

Running by

...ønskende og håbende. Den fremstår nærmere som en sang, og de mange udråbstegn gør den mere levende fremfor tekst 1. ok, men sig også noget om den slang der optræder i tekst 2. D Oversættelse: A thirteen years old French girl, who was lying in a coma, awoke and spoke German fluently. It is rather strange, because her competences in German were not that good. She had only just begun to learn the subject in school, and had only seen a little German television. Since she awoke two weeks agoher parents have, howver, only succeeded in communicating with her in German. “Back in time you considered this a miracle. But personally I think that there is a completely logical explanation” says an expert on the area. Delprøve 2 Because it is running by. INTRO! The story takes place in England where Wil lives on a farm with lot of field together with his mom. They are alone living a poor life. Their father and husband died years ago, of throat cancer, after his dead most of the land was sold to the estate. Their mundane live involves hard work for the company called B&B. A girl called Edie comes by from London, she is going to help Wil and his mom in the summer time where they are busy working in the field. Wil is a young man who hasn’t built up a family and he is also without a girlfriend. His life is the field and the house were his mother and he have lived for many years. It´s like he is without joy in life because of his father´s death and therefore a big loss in......

Words: 1642 - Pages: 7

Customer Clusters

...Customer Clusters as Sources of Innovation-Based Competitive Advantage Vishal Bindroo, Babu John Mariadoss, and Rajani Ganesh Pillai ABSTRACT The authors examine the effect of customer clusters on a firm’s innovation. They argue that knowledge leveraged from customer clusters can help the firm develop innovations. The authors specifically concentrate on the effect of a firm’s geographical proximity and diversity of customer clusters on innovation outcomes. In addition to showing the importance of customer cluster proximity on firm innovation, they explore the effect of customer cluster heterogeneity on innovation in an international marketing environment. They test the theoretical model using multicountry data (N = 288) drawn from the U.K. innovation survey implemented by the Economic and Social Research Council, which collected the data across five European countries. Theoretical constructs operate largely as hypothesized and explain a substantial proportion of the variation in the different innovation outcomes tested. Keywords: radical innovation, customer cluster, cluster heterogeneity, proximity, innovation speed I nnovation is frequently acknowledged as the source of organizational renewal and growth, the primary source of competitive advantage (Porter 1990), and central to marketing strategy (Varadarajan and Jayachandran 1999). Because innovation is linked to superior financial performance and survival ability of firms (Agarwal, Cockburn, and McHale 2006),......

Words: 11227 - Pages: 45

India Pharmaceutical Cluster

...Pharmaceutical Cluster in Andhra Pradesh Microeconomics of Competitiveness Final Project Harvard Business School Helene Herve | Lhakpa Bhuti | Saurabh Agarwal | Sonny Kushwaha | Akbar Causer May 2013 Table of Contents 1 2 Executive Summary ............................................................................................................................ 3 Introduction to India ........................................................................................................................... 4 2.1 2.2 History and Political Climate ....................................................................................................... 5 Competitive Positioning of India ................................................................................................. 6 Endowments .......................................................................................................................... 6 Economic Performance To-Date and Macroeconomic Policy.............................................. 7 Summary of Export Clusters ................................................................................................. 9 Social Infrastructure and Political Institutions.................................................................... 10 India Diamond .................................................................................................................... 11 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 3 3.1 Andhra Pradesh .............................

Words: 9670 - Pages: 39

Running

...Running Running is a means of locomotion especially for terrestrial animals which allow them to move quickly on foot. The body of human beings is adapted to run in several ways. There are three primary nature of running which the body if is able to combine them, then running is achieved and good performance for athletes. This does not mean that the body only relies on this factors in order to execute running, there are also many other factors. These three factors that contribute to an efficient running are; cardiorespiratory endurance, muscular endurance, and sprint or explosive power. Cardiorespiratory endurance can be defined as the ability of the heart and lungs to absorb, transport, and use oxygen in the body during strenuous or intense exercise. Muscle endurance, on the other hand, is the strength of the human body to continuously use muscular strength while enduring repeated contractions for a long period of time. Explosive power is the capability of the body muscle to contract and develop over a short period of time to release energy. Cardiorespiratory endurance, the ability to absorb, transport and utilize oxygen in the body is the final factor that can increase the performance of the more run. Cardiorespiratory endurance exercise such as aerobics should be initiated for every more run if he or she is to be successful. These exercises increase endurance by increasing the lungs volume and strengthening the heart muscles. Cardiorespiratory endurance training enhances...

Words: 1187 - Pages: 5

Beowulf Clusters

...Beowulf Clusters Beowulf clusters were created in the early 1990s by two NASA employee’s, Donald Becker, and Thomas Sterling, to serve their computational needs. They did this by connecting multiple personal computers on a local network that ran on free open source software. This cluster of interconnected computers allowed them to solve task that normally only a supercomputer could perform. Beowulf clusters yield supercomputer performance at a fraction of the cost. They are relatively inexpensive to create since they use commodity hardware, such as personal computers. They also use free open source software such as Linux, to serve as their operating system. Clusters achieve multi-instruction-multi-data multiprocessing by using multiple systems, known as nodes, which are joined together. These nodes are connected via a local area network, which allows them to communicate with one another. These systems are capable of running an application simultaneously on all nodes of the cluster, which in turn, significantly increases performance of the system. However, applications have to be specifically written to utilize all of the computers of the cluster. This is done through parallelization, which is a program that is divided into separate components that run in parallel on individual node of the cluster. Beowulf clusters also yield high availability since each node of the cluster can monitor another over LAN. If one computer fails, another can take over whatever task......

Words: 376 - Pages: 2

Mysql Cluster

...MySQL Cluster Quick Start Guide – LINUX This guide is intended to help the reader get a simple MySQL Cluster database up and running on a single LINUX server. Note that for a live deployment multiple hosts should be used to provide redundancy but a single host can be used to gain familiarity with MySQL Cluster; please refer to the final section for links to material that will help turn this into a production system. 1 Get the software For Generally Available (GA), supported versions of the software, download from http://www.mysql.com/downloads/cluster/ Make sure that you select the correct platform – in this case, “Linux – Generic” and then the correct architecture (for LINUX this means x86 32 or 64 bit). If you want to try out a pre-GA version then check http://dev.mysql.com/downloads/cluster/ Note: Only use MySQL Server executables (mysqlds) that come with the MySQL Cluster installation. 2 Install Locate the tar ball that you’ve downloaded, extract it and then create a link to it: [user1@ws2 ~]$ tar xvf Downloads/mysql-cluster-gpl-7.1.3-linux-x86_64-glibc23.tar.gz [user1@ws2 ~]$ ln -s mysql-cluster-gpl-7.1.3-linux-x86_64-glibc23 mysqlc Optionally, you could add ~/mysqlc/bin to your path to avoid needing the full path when running the processes. 3 Configure For a first Cluster, start with a single MySQL Server (mysqld), a pair of Data Nodes (ndbd) and a single management node (ndb_mgmd) – all running on the same server. Create folders to store the......

Words: 848 - Pages: 4

European Cluster

...Journal of World Business 37 (2002) 69±80 Eastern European cluster: tradition and transition     Gyula Bakacsi, Takacs Sandor, Karacsonyi Andras, Imrek Viktor1 Budapest University of Economic Sciences and Public Administration, 1093 Budapest, Hungary Abstract The eastern European cluster consists of Albania, Georgia, Greece, Hungary, Kazakhstan, Poland, Russia, and Slovenia. It has a population of 232 million and a gross domestic product (GDP) of U.S.$772 billion. The cluster's distinctive cultural practices are high power distance and high family and group collectivism. The region is facing signi®cant challenges during its period of transition from communist philosophy to market-based economies. The participating managers value a much greater degree of future and performance orientation, but are strongly attached to their cultural heritage of deep family and group cohesion. They are also highly value charismatic and team-oriented leadership. The challenges and complexities facing the region are explored in the paper. # 2002 Published by Elsevier Science Inc. 1. Introduction This article provides an in-depth look at the eastern European culture. This region is understudied due to its socialist past and was not (with the exception of Greece and Yugoslavia), included in Hofstede's seminal work (1980) or basic reviewing and synthesizing study of Ronen & Shenkar, 1985 comparative study of 25 countries. Over the past few years, a few writers have examined cultural......

Words: 8788 - Pages: 36

Google's History

...makes it special, //and lessons to learn from the experience of Google’s founders.  First of all, where does the word “Google” come from? The name "Google" originated from a misspelling of "googol,” which refers to 10100, the number represented by a 1 followed by one hundred zeros. It found its way to the English language, now the verb "Google", was added to the Oxford English Dictionary in 2006, meaning, "to use the Google search engine to obtain information on the Internet." Their search engine was originally nicknamed "BackRub" because the system checked back links to estimate a site's importance. /// The start of Google was pretty much like the start of every website. It was a research project to these two Ph.D. Students where they hypothesized that a search engine that analyzed the relationships between websites would produce better ranking of results than existing techniques, which ranked results according to the number of times the search term appeared on a page. It was first related to the university’s domain, but then the traffic was so heavy that the university asked them to move their website to a domain outside the university. What made Google this popular was the speed it pulls out information, which is counted in parts of seconds. And also, the size of their data base, according to the instructor of our instructor in MIS class only 60% of data you found on Google are in other web search engines.  Google’s Stock Price was around $ 110, but it is now traded at......

Words: 780 - Pages: 4

Cluster

...ASSIGNMENT Cluster Analysis of Godrej India Limited Case Submitted to: Prof. Sreedhara Raman Submitted by: Step 1: Agglomeration Schedule: The first step in Cluster Analysis is to find out the number of clusters that should be made. From the below table we observe that the difference between 16th and 15th value is the highest =4.5. Thus, the number of cluster taken is 4. Agglomeration Schedule | Stage | Cluster Combined | Coefficients | Stage Cluster First Appears | Next Stage | | Cluster 1 | Cluster 2 | | Cluster 1 | Cluster 2 | | 1 | 1 | 19 | 11.000 | 0 | 0 | 12 | 2 | 11 | 20 | 15.000 | 0 | 0 | 11 | 3 | 8 | 9 | 15.000 | 0 | 0 | 8 | 4 | 6 | 10 | 17.000 | 0 | 0 | 11 | 5 | 5 | 13 | 18.000 | 0 | 0 | 12 | 6 | 14 | 18 | 19.000 | 0 | 0 | 15 | 7 | 7 | 15 | 20.000 | 0 | 0 | 15 | 8 | 2 | 8 | 20.500 | 0 | 3 | 14 | 9 | 16 | 17 | 22.000 | 0 | 0 | 14 | 10 | 4 | 12 | 23.000 | 0 | 0 | 16 | 11 | 6 | 11 | 24.000 | 4 | 2 | 13 | 12 | 1 | 5 | 24.000 | 1 | 5 | 13 | 13 | 1 | 6 | 26.750 | 12 | 11 | 16 | 14 | 2 | 16 | 28.000 | 8 | 9 | 17 | 15 | 7 | 14 | 28.000 | 7 | 6 | 18 | 16 | 1 | 4 | 32.500 | 13 | 10 | 19 | 17 | 2 | 3 | 32.800 | 14 | 0 | 18 | 18 | 2 | 7 | 36.250 | 17 | 15 | 19 | 19 | 1 | 2 | 44.300 | 16 | 18 | 0 | Step 2: Final Cluster Centers: From this table we identify the major characteristics of the respondents belonging to different clusters, which will help us to create a Cluster Profile. Final Cluster......

Words: 685 - Pages: 3

Simplified Data Processing on Large Clusters

...MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay@google.com Google, Inc. Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day. 1 Introduction Over the past five years, the authors and many others......

Words: 9138 - Pages: 37

Cluster Computing

...1 : Basic of Cluster Computing 1. INTRODUCTION 1.1 Background study Parallel computing has seen many changes since the days of the highly expensive and proprietary super computers. Changes and improvements in performance have also been seen in the area of mainframe computing for many environments. But these compute environments may not be the most cost effective and flexible solution for a problem. Over the past decade, cluster technologies have been developed that allow multiple low cost computers to work in a coordinated fashion to process applications. The economics, performance and flexibility of compute clusters makes cluster computing an attractive alternative to centralized computing models and the attendant to cost, inflexibility, and scalability issues inherent to these models. Many enterprises are now looking at clusters of high-performance, low cost computers to provide increased application performance, high availability, and ease of scaling within the data center. Interest in and deployment of computer clusters has largely been driven by the increase in the performance of off-the-shelf commodity computers, high-speed, low-latency network switches and the maturity of the software components. Application performance continues to be of significant concern for various entities including governments, military, education, scientific and now enterprise organizations. This document provides a review of cluster computing, the various types of clusters and their......

Words: 5312 - Pages: 22

Running

...Running, Mind, Body, and Sole Health Benefits Running may be painful and tiring, but it has also become a very popular past-time all over the world. Running can benefit your entire body while keeping you fit and in shape. Studies show that running five miles a day can burn up to two thousand calories at a time. Running is a cost effective easy activity that anyone can do anywhere. Running cannot only be beneficial to your body but also to your mind therefore it helps your health in many ways. Knowing how beneficial running is has made it very popular in the last few years. By increases your health overall why wouldn't you consider it? In addition to losing weight, relieving stress, improving lung function, also running will help boost your immune system. A free, besides a good pair or running shoes, the activity that can benefit your way of life seems like a smart thing to start. There are so many more benefits to running that should get people motivated. Running is an easy, free, excellent way to lose weight and maintain the weight loss. Once achieve your targeted weight running will help you stay consistent. Running is an effective exercise that burns the most calories per minute. "Syracuse University researchers measured the actual calorie burn of 12 men and 12 women while running and walking a mile each. The men burned an average of 124 calories while running, and just 88 while walking; the women burned 105 and 74." (www.RunnersWrold.com, How Many......

Words: 1576 - Pages: 7

Running

...Running Outside or Treadmill Running Running Outside or Treadmill Running Humans are born to run, at least some are. Some run because they love to run and others run to stay in shape. No matter what the reason is as to why a person runs, running can be beneficial for their health. It is well known that regular exercise, such as running, is a total body workout. When a subject begins a running program the biggest benefit they will see is they will lose weight. Running is an aerobic exercise that involves the whole body and little equipment is needed to enjoy this activity. Runners have different choices from where they decide to run to which type of surfaces they prefer to run on. After a subject decides to begin a running program, they are now faced with the dilemma of where to go. Do they head down to the beach and run on the path or do they go to the local fitness club and run on a treadmill? The question is do they choose outdoor running or treadmill running? There are advantages and disadvantages to both. Everyone has a preference as to what kind of running they do, so it will be up to the individual as to what they will choose. Outdoor running gives the subject the pleasure of being in nature where they can enjoy the sights and sounds of the environment. They can choose their path to be challenging or easy. It can be on the city streets or on rigorous trails in the mountains. The possibilities are limitless. Some enjoy...

Words: 852 - Pages: 4

Clusters

...You can achieve supercomputer performance from off-the-shelf PCs by running a Linux clustering system called Beowulf. New developments add the ability to run Beowulf clusters on 64-bit AMD Opteron processors, dramatically improving the performance of clustered computers. Beowulf provides one way to group a set of computers to work on a single task. One PC acts as the master of the cluster, controlling the other computers. The other computers each act as stripped-down computation devices, performing operations in parallel. Each computer in the cluster gets one small piece of an overall task. All the computers in the cluster communicate over a high-speed internal network. The power of Beowulf clustering lies in the usage of off-the-shelf hardware, dramatically reducing the cost for creating what can be supercomputer performance, at least for tasks that work well with clusters. Beowulf clusters work best for computational tasks that can be divided into relatively independent pieces. For example, a lot of weather prediction and graphics ray-tracing for movie special effects fit well into Beowulf-style clusters. One of the neat things about the clusters is that the software can work on older PCs, turning boxes relegated to boat anchors and door stops into computation engines. Beowulf, though, isn't just one software package. There are several packages you can install to make up parts of a Beowulf cluster such as Parallel Virtual Machine (PVM), Message Passing Interface......

Words: 418 - Pages: 2