Mango
web log storage and analysis
August 09, 2011
  • Daniel Einspanjer - deinspanjer
  • Anurag Phadke - aphadke
  • metrics@mozilla.com
  • #metrics on irc.mozilla.org
Mango
web log storage and analysis

Data flow / architecture

  • Where does the data come from? (Extract)
  • What does it look like before and after? (Transform)
  • Where does it end up? (Load)
Mango
web log storage and analysis

Where does the data come from?

Mango
web log storage and analysis

What does it look like before and after?

  • ip address
  • site name
  • authenticated user name
  • date and time of request
  • HTTP request method e.g. GET POST HEAD
  • Request URL
  • HTTP request version e.g. HTTP/1.1
  • HTTP Response Code e.g. 200 302 404 500
  • Response size in bytes
  • Referrer URL -- the page containing a link the user clicked to get to this page
  • User Agent string e.g. "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"
  • Cookies
127.0.0.1 addons.mozilla.org - [08/Aug/2011:00:00:04 -0700] "GET /blocklist/3/.../5.0/Firefox/.../ HTTP/1.1" 200 11676 "-" "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0" "-"
Mango
web log storage and analysis

What does it look like before and after?

Partitioned by: site name / UTC date / UTC hour / Processing run time
AnonymizedNon-Anonymized
  • data center / hostname
  • web log filename / line number in file
  • UTC and client local request timestamp
  • Country code of request
  • GeoHash location of request i.e. rough latitude/longitude
  • Domain name of IP if known
  • Organisation name of IP owner if known
  • ISP name of IP if known
  • HTTP request method
  • Request URL
  • HTTP request version
  • HTTP Response Code
  • Response size in bytes
  • Referrer URL
  • Browser category / name / version
  • User agent locale (if given)
  • User agent platform
  • User agent OS category / name
  • is IP address in blacklist i.e. a Mozilla proxy or internal host
  • data center / hostname
  • web log filename / line number in file
  • ip address
  • authenticated user name
  • User Agent string
  • Cookies
Mango
web log storage and analysis

What does it look like before and after?

Mango
web log storage and analysis

Where does it end up?

  • Hadoop cluster of ~60 machines
  • Stored by site, date, and hour
  • Queryable via Hive, a SQL like query engine
  • The Non-Anonymized data is deleted according to our data retention policy
Mango
web log storage and analysis

What's the average distance travelled by files outside US

Mango
web log storage and analysis

What's the average distance travelled by files outside US?

3802 miles

Mango
web log storage and analysis

How many log lines do we currently process per day?

Mango
web log storage and analysis

How many log lines do we currently process per day?

985,939,023

Mango
web log storage and analysis

What is HIVE?

  • An interface that allows user to run SQL'ish queries on data stored inside HDFS
  • Simple SQL support:
  • SELECT * FROM weblogs LIMIT 20;
Mango
web log storage and analysis

Total number of DNT (DoNotTrack) users

      SELECT ds, dnt_type, count (distinct ip_address)
      FROM aus_logs
      WHERE (request_url LIKE '%Firefox/4.0%' OR request_url LIKE '%Firefox/5.0%') AND dnt_type != 'DNT:1, 1' AND ds >’2011-04-17’
      GROUP BY ds, dnt_type
      ORDER BY ds DESC;
      
Mango
web log storage and analysis

DNT User Adoption

Mango
web log storage and analysis

Out of date URLs for Mozilla.com (bug: 675687)

SELECT COUNT(1), request_url
FROM research_logs
WHERE domain = 'www.mozilla.com'
AND (request_url LIKE 'en-US/launch'
          OR request_url LIKE 'en-US/firefox/updated'
          OR request_url LIKE 'en-US/firefox/loring'
          OR request_url LIKE 'en-US/firefox/pganonffx'
          OR request_url LIKE 'en-US/firefox/pgbffx'
          OR request_url LIKE 'en-US/firefox/pgcshort’)
AND ds > '2011-07-20'
GROUP BY request_url; 
      
Count URL
43023 /firefox/switch.html
6698 /firefox/fastest/
6470 /en-US/firefox/updated
1915 /en-US/add-ons/campus/
1637 /firefox/addons
1203 /en-US/press/images.html
1100 /en-US/add-ons/kodak/
1055 /en-US/firefox/updated/
Mango
web log storage and analysis

Startup times for Firefox browser (courtesy fligtar) - AMO ping data

Mango
web log storage and analysis

Some CSP reports are sent as 404 pages

SELECT count(1) AS c, request_url, referrer_url
FROM research_logs
WHERE ds = ‘2011-05-23’
AND request_url LIKE ‘%csp%’
AND request_code = 404
GROUP BY request_url
ORDER BY c DESC;
This turned out to be a CRITICAL cross-scripting bug (bug id: 664151) - Thanks Anthony Ricaud
Mango
web log storage and analysis

Thunderbird Data request (Bug: 669701)

  • •Why users are still on an old ThunderBird version?
  • Need table dump with request_url and anonymized IP to do further analysis
INSERT OVERWRITE  DIRECTORY 'temp/'
SELECT *
FROM aus_logs
WHERE (domain = 'aus2.mozilla.org' 
       OR domain = 'aus3.mozilla.org') 
AND ds >= '2011-07-20'
AND ds <= '2011-07-28'
AND LOWER(request_url) LIKE "%thunderbird/2%";
Mango
web log storage and analysis

Future enhancements

  • Streamline the data flow from remote webheads to insert data directly inside HDFS instead of staging on im-log02 NFS
  • Provide Hue as a web interface to allow Mozilla employees to query directly
  • Provide reports / exports for commonly requested data