mediahound
Dabbler
- Joined
- Mar 11, 2013
- Messages
- 15
Howdy, hopefully my title got some attention.
I'm here because of ZFS. : P
I currently have or am involved in several projects all which are driving me towards wanting to use freeNAS and ZFS but i'm more of a media librarian than a computer expert so the whole process has thus far been postponed due to having no idea whats involved, not wanting to become a sysadmin, not understanding unix, worries about putting data into something I can't ever get it back OUT of again, and similar. The three main projects are:
1) Personal Media Library (sure I collect data like lots of people here probably do, though primarily educational content like college lecture video recordings, wanting it for a home media server) but so far this is tolerable on windows. It doesn't have to migrate over, but there's advantages if freeNAS works better.
2) Business Project (low budget moviemaking and computer graphics which will suck up terabyte after terabyte), for this one performance actually does matter but it can be a totally separate system.
3) Historical Archiving - this has the potential to grow quite a bit, for instance there are thousands of foreign language medical videotapes and thousands of books and journals from China i'm hoping to get slowly scanned in just by myself over the years for digital preservation before they degrade any further, a project which might take a few years by itself since it's not like i'm doing this for profit or full time but just for preservation. If I finish mine and get access to others libraries I hope to scan in theirs as well - future scan/digitizing projects will be archives from Russia and Georgia - but i'm so overwhelmed with what I already have that I stopped even seeking or asking who else has stuff to put in the queue. This is all part of an alternative medicine research project for those curious, the stuff from china is about chinese medicine, the stuff from russia some alternative medicine stuff they have/do there, and though I might mirror others archives on other topics (which would grow the database massively but which would generally be other educational topics on everything imaginable similar to the 'CD of the 3rd world' project or the Appropriate Technology Libraries but with video instead of text) those are the main two i'm trying to archive and preserve as possibly within my grasp. If I work out "a system" to easily digitize videotapes and scan in books/journals physically, and to reliably archive/integrity verify and stop further degradation of that material, i'm hoping to set up other people in foreign countries as well, some with little computer knowledge, of how to help that. So they can just be mailed extra hard drives when needed and periodically mail back a full one when they have new stuff to share since they're in places internet is slow. The whole archive has to be mirrored in several places, and organized so that 'chunks' can be more easily distributed to those most in need, ie prioritizing the most worthy 4tb or whatever the largest single drive size is to mail to someone with a laptop in Guatemala to access for instance since they wont have superfast internet to do it all online, it has to be mostly self contained.
What I have so far is a snakesnest of drives hooked up to a windows PC until I ran out of drive letters, and then a couple hard drive crashes and lost data later i'm leaving certain things offlined entirely until I work out a better system. :- / The nightmares of sysadminning and maintenance and moving files from drive 12 to drive 7 and such trying to organize around limited drive space or a dying crashing drive are already the #1 impediment to anything else being done, it's become apparent growing it is pointless when i'm already losing data and having to re-scan it for sometimes the 3rd time despite having backups before having a few bits in the middle of a large file corrupt on both drives and still needing it to be done. I've created md5's which helps but creating PARchive files is so incredibly labor intensive that it can take WEEKS to make parity files for a single drive so I can't even keep up with that anymore.
So at the moment i'm seeking advice on how to proceed to include migrating data and getting used to things.
Because of the thousands of hours of work that will go into either project 2 OR 3, obviously backups are waaaay important. From what i've researched this may or may not be within ZFS's paradigm - the easy migration of data to offsite storage, and possibly the integration/importing of data likewise. What i'd want is to have ideally is a set of geographically distributed computers, which would either add data by importing from a drive (for when bandwidth to/from the country was still expensive) as well as realtime FTP-style sync-ing (for when bandwidth was cheap and uncapped). ZFS seems to be the king of preventing data corruption, detecting corruption and repairing from corruption within a single computer - but I definately need solutions beyond one single site computer since that can be wiped out by one bad power surge, one flood, or one house fire. Most syncs will be by sneakernet mailing 4tb drives around, because neither me nor the people who may be willing to help so far have uncapped internet, but for certain very high effort or important or time critical things we might want to have a priority/"push" directory to force a sync/mirroring right away.
If anyone has even the faintest clue of how to do the above please share because I don't. : P But thats what the need is and it might even break freeNAS compatibility or require some radical workaround that makes something else easier to use. Cluster level integrity (if thats the right term) is what matters more than individual machine survival where ZFS seems to excel best at.
Budget is very important! :( Enterprise level solutions without an enterprise level budget. Especially for the historical archive, i'm a college student trying to prevent loss of things some people don't seem to care much about. If it's possible to save $600 not buying a RAID card that lets me buy 15tb more of storage right now. As close to 100% of the budget needs to go to hard drives, due to the amount of data and the level of redundancy sought, both which will grow over time. Expandability is important too, if I hit some hard limit on what one low budget server can contain, how do I expand beyond that while still having it treated like a monolithic data set, simply add a 2nd server with more drives? There has to be some known easy to follow strategy so that no matter how fast the data grows i'll know what to do, and some kind of low maintenance "i dont want to be a sysadmin" solution so that if it just tells me to replace hard drive 4BB_X on rack 2 I can do it without worrying that much. My job should mostly be to upload data into the array, add drives, replace drives, or replace whole motherboard server units if a system starts failing or showing some kind of problem thats taking down one node of a cluster, and not have to worry about things like did some virus corrupt files on there, or did I accidentally delete that subdirectory at 3am in sleep deprivation, or whatever. Much of the data wont even be sorted or translated probably for 10 years but it has to survive undamaged until things like machine-assisted translation make the process alot easier including what will then be easy like I assume convert a whole video to english, keyword mark it, and make it searchable by time code.
Power use may become an issue. Right now most drives sit idle despite being on and connected. ZFS's desire to scrub the whole archive every week spooling up 24 drives would probably throw the breaker though. : P Do they have any 'reduced power usage' modes or ways to structure the archive, or I mean if i'm accessing one file it's only accessing the drive that it's on and not spooling up and down multiple drives just for that one file? Having powered down archives, either drives or backup servers is possible. Like something that turns on, mirrors everything, then turns off every two weeks or something. (between a backup drive writing changed-file deltas which then starts over) Actually the ideal may be to somehow compartmentalize lesser used data that once it's up and verified and processed into some desired final state (which may take awhile/a project may be open for months before I get a chance to do that) to stick on like a 3-4TB drive at a time that is then mirrored to a backup and shut down, periodically rechecked to verify it's condition but generally not accessed much if at all, just shelved and kept. Yet we'd want to know exactly where that data is, what drive it's on, when it was last checked, have a separate backup of parity/restore type data should anything have been corrupted on that drive in the meanwhile even if both drives lost a few bits, etc etc... if that makes sense. Total power use of everything IS an issue for a 24/7 on file server with this many spools. I'm already eating Ramen and freegan too much just to afford more hard drives as it is.
Total ultimate size of the archive with others helping could well reach a petabyte or more, it will at least reach 200TB. Any business/movie projects would probably reach 100TB without too much difficulty. Whether there's any advantage merging all projects into one server (personal/business/charitable) or whether it makes more sense to keep them separate, i'm open to suggestions to. Just one server is more convenience and less power though, having the ability to split that later or merge it back would be another plus.
Feel free to comment on any part of the above. I've read through a few getting started stuff and similar but am already having an issue with like there isn't even motherboards with enough RAM to stick all the drives on just one freeNAS array if they want 1gig ram per 1TB of drive storage, or the additional cost of those that do almost make it cheaper to get two more user-level motherboards, maybe the primary machine (with fairly low data redundancy, equal to RAID6/two parity drives at most) and then a backup computer which is usually powered down, or which schedules turning on, mirroring, and turning off. Everything is still open for discussion including beyond the single-ZFS-running-machine level.
I particularily like using USB drives due to not having to worry about special server class hard drive chassis, special hot swap ability in the SATA connectors, expanding the power supplies, exceeding the designed level (ie 16, 20, 24 drives) of the whole system and similar. Also the ease of just pulling a bad drive and replacing it without a screwdriver or just having all the drives sitting on a bookshelf with a big box fan blowing at them. If i'm low on space and all I have to do is stick a 4way or 7way hub off one of the ports to keep adding drives until I can afford a 2nd server, instead of hitting a wall of what a pair of RAID cards can do off the pci-e slots that's a plus as well since the performance needs is not excessive right now. I dont know if there's any ZFS provision (not from what I read yet, maybe there should be?) but the ability to have a user either connect or power up any usb drive when told (mounted at a fixed position) to copy files handily off to any other connected drive could well make sense. Something where the drive is still USB connected but the power to it is off unless it signals you to turn on the power to that external drive to grab files needed. (this would normally be a single user system, i'm aware that couldn't possibly work in any other scenario)
So can anyone help me understand which parts freenas/ZFS will DO out of the above, and which things I have to plan around the... at least difficulties and limitations of (like needing so much RAM) or where I will still need to learn about other software? I want to get an idea of the whole picture before I even start shopping for hardware, because this is being driven by the need for so many drives before anything else.
I'm here because of ZFS. : P
I currently have or am involved in several projects all which are driving me towards wanting to use freeNAS and ZFS but i'm more of a media librarian than a computer expert so the whole process has thus far been postponed due to having no idea whats involved, not wanting to become a sysadmin, not understanding unix, worries about putting data into something I can't ever get it back OUT of again, and similar. The three main projects are:
1) Personal Media Library (sure I collect data like lots of people here probably do, though primarily educational content like college lecture video recordings, wanting it for a home media server) but so far this is tolerable on windows. It doesn't have to migrate over, but there's advantages if freeNAS works better.
2) Business Project (low budget moviemaking and computer graphics which will suck up terabyte after terabyte), for this one performance actually does matter but it can be a totally separate system.
3) Historical Archiving - this has the potential to grow quite a bit, for instance there are thousands of foreign language medical videotapes and thousands of books and journals from China i'm hoping to get slowly scanned in just by myself over the years for digital preservation before they degrade any further, a project which might take a few years by itself since it's not like i'm doing this for profit or full time but just for preservation. If I finish mine and get access to others libraries I hope to scan in theirs as well - future scan/digitizing projects will be archives from Russia and Georgia - but i'm so overwhelmed with what I already have that I stopped even seeking or asking who else has stuff to put in the queue. This is all part of an alternative medicine research project for those curious, the stuff from china is about chinese medicine, the stuff from russia some alternative medicine stuff they have/do there, and though I might mirror others archives on other topics (which would grow the database massively but which would generally be other educational topics on everything imaginable similar to the 'CD of the 3rd world' project or the Appropriate Technology Libraries but with video instead of text) those are the main two i'm trying to archive and preserve as possibly within my grasp. If I work out "a system" to easily digitize videotapes and scan in books/journals physically, and to reliably archive/integrity verify and stop further degradation of that material, i'm hoping to set up other people in foreign countries as well, some with little computer knowledge, of how to help that. So they can just be mailed extra hard drives when needed and periodically mail back a full one when they have new stuff to share since they're in places internet is slow. The whole archive has to be mirrored in several places, and organized so that 'chunks' can be more easily distributed to those most in need, ie prioritizing the most worthy 4tb or whatever the largest single drive size is to mail to someone with a laptop in Guatemala to access for instance since they wont have superfast internet to do it all online, it has to be mostly self contained.
What I have so far is a snakesnest of drives hooked up to a windows PC until I ran out of drive letters, and then a couple hard drive crashes and lost data later i'm leaving certain things offlined entirely until I work out a better system. :- / The nightmares of sysadminning and maintenance and moving files from drive 12 to drive 7 and such trying to organize around limited drive space or a dying crashing drive are already the #1 impediment to anything else being done, it's become apparent growing it is pointless when i'm already losing data and having to re-scan it for sometimes the 3rd time despite having backups before having a few bits in the middle of a large file corrupt on both drives and still needing it to be done. I've created md5's which helps but creating PARchive files is so incredibly labor intensive that it can take WEEKS to make parity files for a single drive so I can't even keep up with that anymore.
So at the moment i'm seeking advice on how to proceed to include migrating data and getting used to things.
Because of the thousands of hours of work that will go into either project 2 OR 3, obviously backups are waaaay important. From what i've researched this may or may not be within ZFS's paradigm - the easy migration of data to offsite storage, and possibly the integration/importing of data likewise. What i'd want is to have ideally is a set of geographically distributed computers, which would either add data by importing from a drive (for when bandwidth to/from the country was still expensive) as well as realtime FTP-style sync-ing (for when bandwidth was cheap and uncapped). ZFS seems to be the king of preventing data corruption, detecting corruption and repairing from corruption within a single computer - but I definately need solutions beyond one single site computer since that can be wiped out by one bad power surge, one flood, or one house fire. Most syncs will be by sneakernet mailing 4tb drives around, because neither me nor the people who may be willing to help so far have uncapped internet, but for certain very high effort or important or time critical things we might want to have a priority/"push" directory to force a sync/mirroring right away.
If anyone has even the faintest clue of how to do the above please share because I don't. : P But thats what the need is and it might even break freeNAS compatibility or require some radical workaround that makes something else easier to use. Cluster level integrity (if thats the right term) is what matters more than individual machine survival where ZFS seems to excel best at.
Budget is very important! :( Enterprise level solutions without an enterprise level budget. Especially for the historical archive, i'm a college student trying to prevent loss of things some people don't seem to care much about. If it's possible to save $600 not buying a RAID card that lets me buy 15tb more of storage right now. As close to 100% of the budget needs to go to hard drives, due to the amount of data and the level of redundancy sought, both which will grow over time. Expandability is important too, if I hit some hard limit on what one low budget server can contain, how do I expand beyond that while still having it treated like a monolithic data set, simply add a 2nd server with more drives? There has to be some known easy to follow strategy so that no matter how fast the data grows i'll know what to do, and some kind of low maintenance "i dont want to be a sysadmin" solution so that if it just tells me to replace hard drive 4BB_X on rack 2 I can do it without worrying that much. My job should mostly be to upload data into the array, add drives, replace drives, or replace whole motherboard server units if a system starts failing or showing some kind of problem thats taking down one node of a cluster, and not have to worry about things like did some virus corrupt files on there, or did I accidentally delete that subdirectory at 3am in sleep deprivation, or whatever. Much of the data wont even be sorted or translated probably for 10 years but it has to survive undamaged until things like machine-assisted translation make the process alot easier including what will then be easy like I assume convert a whole video to english, keyword mark it, and make it searchable by time code.
Power use may become an issue. Right now most drives sit idle despite being on and connected. ZFS's desire to scrub the whole archive every week spooling up 24 drives would probably throw the breaker though. : P Do they have any 'reduced power usage' modes or ways to structure the archive, or I mean if i'm accessing one file it's only accessing the drive that it's on and not spooling up and down multiple drives just for that one file? Having powered down archives, either drives or backup servers is possible. Like something that turns on, mirrors everything, then turns off every two weeks or something. (between a backup drive writing changed-file deltas which then starts over) Actually the ideal may be to somehow compartmentalize lesser used data that once it's up and verified and processed into some desired final state (which may take awhile/a project may be open for months before I get a chance to do that) to stick on like a 3-4TB drive at a time that is then mirrored to a backup and shut down, periodically rechecked to verify it's condition but generally not accessed much if at all, just shelved and kept. Yet we'd want to know exactly where that data is, what drive it's on, when it was last checked, have a separate backup of parity/restore type data should anything have been corrupted on that drive in the meanwhile even if both drives lost a few bits, etc etc... if that makes sense. Total power use of everything IS an issue for a 24/7 on file server with this many spools. I'm already eating Ramen and freegan too much just to afford more hard drives as it is.
Total ultimate size of the archive with others helping could well reach a petabyte or more, it will at least reach 200TB. Any business/movie projects would probably reach 100TB without too much difficulty. Whether there's any advantage merging all projects into one server (personal/business/charitable) or whether it makes more sense to keep them separate, i'm open to suggestions to. Just one server is more convenience and less power though, having the ability to split that later or merge it back would be another plus.
Feel free to comment on any part of the above. I've read through a few getting started stuff and similar but am already having an issue with like there isn't even motherboards with enough RAM to stick all the drives on just one freeNAS array if they want 1gig ram per 1TB of drive storage, or the additional cost of those that do almost make it cheaper to get two more user-level motherboards, maybe the primary machine (with fairly low data redundancy, equal to RAID6/two parity drives at most) and then a backup computer which is usually powered down, or which schedules turning on, mirroring, and turning off. Everything is still open for discussion including beyond the single-ZFS-running-machine level.
I particularily like using USB drives due to not having to worry about special server class hard drive chassis, special hot swap ability in the SATA connectors, expanding the power supplies, exceeding the designed level (ie 16, 20, 24 drives) of the whole system and similar. Also the ease of just pulling a bad drive and replacing it without a screwdriver or just having all the drives sitting on a bookshelf with a big box fan blowing at them. If i'm low on space and all I have to do is stick a 4way or 7way hub off one of the ports to keep adding drives until I can afford a 2nd server, instead of hitting a wall of what a pair of RAID cards can do off the pci-e slots that's a plus as well since the performance needs is not excessive right now. I dont know if there's any ZFS provision (not from what I read yet, maybe there should be?) but the ability to have a user either connect or power up any usb drive when told (mounted at a fixed position) to copy files handily off to any other connected drive could well make sense. Something where the drive is still USB connected but the power to it is off unless it signals you to turn on the power to that external drive to grab files needed. (this would normally be a single user system, i'm aware that couldn't possibly work in any other scenario)
So can anyone help me understand which parts freenas/ZFS will DO out of the above, and which things I have to plan around the... at least difficulties and limitations of (like needing so much RAM) or where I will still need to learn about other software? I want to get an idea of the whole picture before I even start shopping for hardware, because this is being driven by the need for so many drives before anything else.