I’ve done a number of talks lately on Worldmaps and typically in side conversations/emails, people are curious about the databases and converting IP addresses to geographic locations. And, often when you dive into using the data, it seems there are a number of performance considerations and I thought I’d share my input on these topics. First up, the data. Worldmaps uses two databases for IP resolution. The primary/production database is Ip2Location. I’ve found this database to be very accurate. For development/demo purposes, I use IPinfoDB. I haven’t had too much time to play with this database yet, but so far seems accurate also. The latter is free, whereas Ip2Location is not. In either case, the schema is nearly identical: The BeginIp and EndIp columns are a clustered primary key. In the case of IPinfoDB, there is no EndIp field (and it’s not really needed). When performing a resolution, a string IP address is converted into a 64 bit integer and then used in searching the table. That’s why having a clustered key on the BeginIp (and optionally EndIp) is crucial to performance. But it doesn’t stop there. The examples posted in the database’s respective home pages are accurate and simple, but need to be refactored for performance. For example, to do a simple resolution on Ip2Location, according to their docs:
SELECT * FROM dbo.Ip2Location WHERE @IpNum BETWEEN BeginIp and EndIp
And for IPInfoDB:
SELECT TOP 1 * FROM IPInfoDB where BeginIp <= @IpNum ORDER BY BeginIp DESC
Both of these methods are perfectly fine, particularly for use as generic samples. The second one is on the right track, but it doesn’t work for joins so if you’re querying over a range, you’d need to refactor. And in the first example, using a BETWEEN operator forces a clustered index scan when joining, killing the performance. If we run the first example across my minified Ip2LocationSmall table, we’ll see something like this (and this is running off of SQL Azure – the perf is pretty great compared to localhost!): We can also look at the time: Ouch! Now, it doesn’t seem too bad, but imagine doing thousands of these requests per minute, or doing large joins. The goal then is to provide some hints that will optimize the query, particularly for joins. Our indexes are correct, so we can rework the query to get rid of the BETWEEN operator – we can sacrifice a little readability and do something like:
SELECT * FROM ( select ( select MAX(beginip) from dbo.Ip2LocationSmall where BeginIp <= @IpNum ) as IP_Begin ) as foo INNER JOIN dbo.Ip2LocationSmall iploc ON iploc.BeginIp = foo.IP_Begin
The result: And the time shows some improvement: But the REAL benefit comes when we need to join. Suppose I’d like to get a list of the countries for a given map (which is a parameter called MapId):
SELECT DISTINCT(ip.CountryCode) FROM MapHits hits INNER JOIN dbo.Ip2LocationSmall ip ON hits.IpNum BETWEEN ip.BeginIp AND ip.EndIp WHERE MapId = @MapId
The query returns 95 rows, and executes in 16 seconds: In this case, we can refactor this using the method above to something like:
SELECT DISTINCT(CountryCode) FROM ( select IpNum, ( select MAX(beginip) from Ip2LocationSmall where BeginIp <= IpNum ) as IP_Begin from dbo.MapHits as hits where MapId = @MapId ) as foo INNER JOIN Ip2LocationSmall iploc ON iploc.BeginIp = foo.IP_Begin
Again, not as pretty looking, but boy what a difference: We went from 16,500 milliseconds to 260 – over 60x the performance! Mike @AngryPets would be proud. The reason for the perf gain is we were able to eliminate the nested loop, which is (in this case) scanning the entire clustered index for the matching rows. The second benefit is the ability to switch schemas easily between IP2Location and IPinfoDB, and we can additionally lose the EndIp column which trims the size of the table.
If you’ve been playing with SQL Azure, you might have run into this error when opening a connection to the database: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An established connection was aborted by the software in your host machine.) Ick! Anything at that transport level surely can’t be a problem with my code, can it? :) The goods news is that you aren’t going crazy and the problem isn’t with your code. There’s a difference in the way SQL Azure manages its connections compared to .NET connection pooling, and the result is that the connection pool is giving out connections that have already been closed. When you try to do something with that connection, you get this error. It’s sporadic in that only happens when you get an idle connection from the pool that has already been dropped by SQL Azure. There are a couple of workarounds until a fix is implemented by Microsoft (I’ve been told it’s coming soon). One method is to retry the connection (I hate this one, but it’s a viable option nonetheless). It’s just messy and I’d have to do this in a few dozen places. The amusing fix is to have a thread that continually pings the database, keeping every connection alive. The best fix that I’ve found to date is to simply turn off connection pooling temporarily by adding a pooling=false option to the connection string: I tested this on my webrole, leaving my workerroles as-is, and the webrole has been running for a week or two without a single error, whereas the workerrole (without disabling pooling) will get a couple errors every day. I haven’t done any performance tests but UA testing (which is me) sees no appreciable hit, so I’ll go with this option until the permanent fix is deployed.
If you have an interest in SQL 2008, be sure to check out this event happening tomorrow in Charlotte:
Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4
SQL Server 2008: Improvements in T-SQL
SQL Server 2008: Accelerating SSAS Design and Performance
SQL Server 2008: Policy-Based Management and Multi-Server Administration
SQL Server 2008: Enabling Consolidation for the Enterprise
Creating More Flexible and Richer Reports in SSRS 2008
This event is being sponsored by Microsoft and the Charlotte and Greenville chapters of PASS. Presenters will include, but not limited to SQL Server MVPs, giving you a great opportunity to learn more about SQL Server from the experts.
Room Mt. Kilimanjaro
Room Mt. Everest
Lunch and opening remarks by Peter Shire, Charlotte SQL Server User Group President, and Wayne Snyder PASS President
SQL Server 2008: Enabling Consolidation for the Enterprise.Rick Heiges (SQL Server MVP). Scalability Experts.
Introduction to Dimensional Modeling. Wayne Snyder (SQL Server MVP), Rafael Salas (SQL Server MVP). Mariner
T-SQL Improvements And Data Types.Alejandro Mesa (SQL Server MVP). Bank Of America
Accelerating SSAS Design and Performance. David Botzenhart. Mariner.
SQL Server 2008: Policy-Based Management and Auditing. Paul Waters. Ameco
Creating More Flexible and Richer Reports.John Welch (SQL Server MVP). Mariner
Prizes and networking