An Average Hour In the Life of an Integration Developer

Recently while migrating data into a new Magento build I came across a most familiar problem with most database data dumps: There’s more data then what the client wants on the website! The client provided a spreadsheet in CSV (comma seperated values) format of all the SKUs with S2 or U2 code.  So now I have a query returning all the products for the company, and a spreadsheet with only the products I should import.

Which approach should I take?  I can load the spreadsheet into a collection import run time, and check each SKU individually.  It’s an easy and quick solution, there would be an added 20 to 30 seconds of processing time, which isn’t that much for a one time process, but this adds more logic to the product import that I need to take care of and manage in case of changes.  Not a bad solution, but I can do better.  Exporting the CSV to a table in the SQL database given to us by the client is another approach.  A simple join and where clause automatically filters out all the products we don’t want.  Our product import works with a smaller collection over the entire product import, and there’s no code to manage.  A perfect solution.  I began happily planning out my strategy when I came across an unfortunate realization…

This client in particular uses all configurable products.  What is a configurable a product? Think a clothing website, you click on a shirt, and you choose a size for the shirt.  Each size is one product, and a configurable product is a container for those sizes.  In layman’s terms it groups together products that should be sold on one product page.

The data given in the database includes the SKUs for these configurable products, not a necessary thing, but it ensures the client gets to choose what the SKUs are for their configurable products.  It’s also a bit easier for us as we have a very easy way to group multiple products onto one product page.  Normally we’re given just the base products and rules to group them by, when that happens we auto-generate the SKUs for the configurable products. The unfortunate thing about this is that the spreadsheet CSV with the upload rules didn’t include the configurable SKUs!

So now I have to decide whether to write a complex SQL query just to handle this situation (showing the configurable SKUs, but only if child SKUs show), or go back to my first idea of loading the CSV into a collection.  I love me some SQL queries, but this situation will cause headaches, and even if I did get it working, I wouldn’t be convinced it’s 100% accurate.  So importing the data to a new SQL table is now out of the question.
At this point you might be wondering what the difference between doing a sql query and checking against a collection is.

With a SQL query, the data returned is all or nothing.  You execute a query, you get your data returned, and that’s the end of that.  This is always the starting point of a product import.  It’s easy to manipulate data with simple rules, but when you get into more complex queries it’s easy for things to go wrong. With a collection, we can loop through each product individually and throw rules at our products one by one, at any point in the product import.

In this situation, given the options and limitations I have, the clear solution is to load our CSV into a collection, and weed out the products we don’t want at a point in the product import.  It’s not a bad solution, but I’m always searching for the most elegant and efficient solution.  Unfortunately in this case, the clear winning solution is the one that takes a bit of code management and an extra 30 seconds of overhead.