MatrixAdapt | Logiciel de gestion d'Entreprise, Création et référencement des sites web

Seo Master present to you: By the Google API Infrastructure Team

As we described in a previous post, earlier this week we experienced an outage in our API infrastructure. Today we’re providing an incident report that details the nature of the outage and our response.

The following is the incident report for the Google API infrastructure outage that occurred on April 30, 2013. We understand this service issue has impacted our valued developers and users, and we apologize to everyone who was affected.

Issue Summary

From 6:26 PM to 7:58 PM PT, requests to most Google APIs resulted in 500 error response messages. Google applications that rely on these APIs also returned errors or had reduced functionality. At its peak, the issue affected 100% of traffic to this API infrastructure. Users could continue to access certain APIs that run on separate infrastructures. The root cause of this outage was an invalid configuration change that exposed a bug in a widely used internal library.

Timeline (all times Pacific Time)

6:19 PM: Configuration push begins
6:26 PM: Outage begins
6:26 PM: Pagers alerted teams
6:54 PM: Failed configuration change rollback
7:15 PM: Successful configuration change rollback
7:19 PM: Server restarts begin
7:58 PM: 100% of traffic back online

Root Cause

At 6:19 PM PT, a configuration change was inadvertently released to our production environment without first being released to the testing enviroment. The change specified an invalid address for the authentication servers in production. This exposed a bug in the authentication libraries which caused them to block permanently while attempting to resolve the invalid address to physical services. In addition, the internal monitoring systems permanently blocked on this call to the authentication library. The combination of the bug and configuration error quickly caused all of the serving threads to be consumed. Traffic was permanently queued waiting for a serving thread to become available. The servers began repeatedly hanging and restarting as they attempted to recover and at 6:26 PM PT, the service outage began.

Resolution and recovery

At 6:26 PM PT, the monitoring systems alerted our engineers who investigated and quickly escalated the issue. By 6:40 PM, the incident response team identified that the monitoring system was exacerbating the problem caused by this bug.

At 6:54 PM, we attempted to rollback the problematic configuration change. This rollback failed due to complexity in the configuration system which caused our security checks to reject the rollback. These problems were addressed and we successfully rolled back at 7:15 PM.

Some jobs started to slowly recover, and we determined that the overall recovery would be faster by a restart of all of the API infrastructure servers globally. To help with the recovery, we turned off some of our monitoring systems which were triggering the bug. As a result, we decided to restart servers gradually (at 7:19 PM), to avoid possible cascading failures from a wide scale restart. By 7:49 PM, 25% of traffic was restored and 100% of traffic was routed to the API infrastructure at 7:58 PM.

Corrective and Preventative Measures

In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:

Disable the current configuration release mechanism until safer measures are implemented. (Completed.)
Change rollback process to be quicker and more robust.
Fix the underlying authentication libraries and monitoring to correctly timeout/interrupt on errors.
Programmatically enforce staged rollouts of all configuration changes.
Improve process for auditing all high-risk configuration options.
Add a faster rollback mechanism and improve the traffic ramp-up process, so any future problems of this type can be corrected quickly.
Develop better mechanism for quickly delivering status notifications during incidents.

Google is committed to continually and quickly improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.

Sincerely,

The Google API Infrastructure Team

Posted by Scott Knaster, Editor

2013, By: Seo Master

seo Google API infrastructure outage incident report 2013

Seo Master present to you: Author Photo

By Ilya Grigorik, Web Performance Engineer

Open-source developers all over the world contribute to millions of projects every day: writing and reviewing code, filing and discussing bug reports, updating documentation and project wikis, and so forth. The data generated from this activity can reveal interesting trends across many industries, including popularity of programming languages over time, defect rates, contribution metrics, and popularity of specific frameworks and libraries.

The challenge in extracting these trends is gathering the data. Each project has its own distributed workflow, code repositories, and conventions. Having hosted dozens of my own projects on GitHub, I've long wanted to analyze the developer activity from the 2.6M+ public projects hosted on GitHub. Hence, earlier this year GitHub Archive was born!

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Each day it archives over 120,000 public activities, ranging from new commits and fork events to opening and closing tickets, each with detailed metadata.

Once I collected the data, I needed a tool to analyze it, and that is when I found Google BigQuery. Based on the research behind Dremel, a popular internal tool at Google for analyzing web-scale datasets, BigQuery allowed me to easily import the entire dataset and use a familiar SQL like syntax to comb through the gigabytes of data in seconds. Plus the tool will scale to terabyte datasets, so there is plenty of room to grow!

The best news is that thanks to collaboration from the GitHub and BigQuery teams, the GitHub dataset is now public and available for you to slice and dice in any way you like. No need to worry about data gathering or database schemas: BigQuery will do all the heavy lifting, and you can just compose your queries to be executed in realtime.

Here's a real-world example. What are the most popular programming languages on GitHub over the past month?

chart showing number of commits by language

If you are curious for more, sign up for BigQuery and follow the instructions on githubarchive.org to access the GitHub dataset. You can use the free 100GB query quota to run your analysis and perhaps even win some of the prizes from the GitHub Data Challenge!

Ilya Grigorik is a Web Performance Engineer and Advocate at Google, an open-source evangelist, and an analytics geek. You can find him on GitHub under igrigorik, and blogging about web performance at igvita.com.

Posted by Scott Knaster, Editor

2013, By: Seo Master

seo Using Google BigQuery to learn from GitHub data 2013

Seo Master present to you:

By Bartholomew Furrow, Software Engineer

In Mountain View and in offices around the world, Googlers are spending their 20% time to get ready for Google Code Jam 2011, preparing algorithmic problems for the 10,000 or more contestants who we expect to compete in our Qualification Round this Friday.

A good Code Jam problem has a story to ground it in some version of reality: soccer, ninja and messages from alien cultures have all served admirably. Cushioned by the story, the core of a Code Jam problem is an algorithmic puzzle whose solution needs anything from a few lines of code to a deep understanding of flow algorithms or number theory.

The ninja in the middle is solidly grounded in reality.

Anyone at Google can create Code Jam problems, which means that our methods for inventing them vary wildly. One author might come across a real-life situation, think about what algorithm would solve it, and base a problem on that; another author might think about how to make a problem out of a video game. Sometimes a problem author will start with an algorithm and concoct a problem that it solves. We also really seem to like inventing weird situations on chess boards.

With the story and the problem chosen, our work is only partly done. The problem has to be stated in such a way that it will be clear, even for an audience from 125 countries. At least three engineers work on each problem’s statement: that group includes at least one native English speaker to make sure the grammar is all correct, and at least one non-native English speaker to make sure the language is clear enough.

The toughest part about setting up a problem like this is verifying that contestants got it right. In Code Jam, we do that by providing contestants with an input file full of test data. They send back their program’s output, which should be the answer to the input file’s question. The hard part is deciding what goes in that input file: we need edge cases, plenty of average cases, and a good number of cases that make sure the contestant’s code is fast enough. To create all of those, we generate some cases by hand and others pseudo-randomly. We’ve been known to generate a test case or two out of ASCII art, or as a creative-writing exercise.

Finally, we solve the problems ourselves. We require at least three solutions made by different engineers, and sometimes we have those engineers write solutions that we know to be wrong – just to make sure our test data catches them out.

The end result of this process is the kind of problem we’re proud to ask our contestants to solve. In 2011 more than any other year, we’re excited about the creativity of our colleagues and the problems we’re planning to pose. We hope you’ll enjoy the problems from the other side – and if you’re a great software engineer, maybe come help us write them in 2012.

You can register for Google Code Jam 2011 at http://code.google.com/codejam, and you’ll see the first problems of the year in the Qualification Round this Friday, May 6, starting at 23:00 UTC. For even more details about how we get problems ready for Code Jam, you can read our official problem-preparation guide.

Bartholomew Furrow spends 80% of his time at Google finding ways to eliminate bad search ads, and the rest on Code Jam. Programming contests introduced him to Computer Science, to Google, and to his wife.

Posted by Scott Knaster, Editor

2013, By: Seo Master

MatrixAdapt | Logiciel de gestion d'Entreprise, Création et référencement des sites web

Les nouveautés et Tutoriels de Votre Codeur | SEO | Création de site web | Création de logiciel

seo Google API infrastructure outage incident report 2013

seo Using Google BigQuery to learn from GitHub data 2013

seo Google Code Jam 2011 starts this Friday 2013

Labels

Blog Arşivi