Inside the Xbox 360


This is a reprint article, translate to sword and spear Blue.
While convenient I make a small investigation: Everybody is hoped I in publish news time is without the slightest show of feeling, looks ashen, has the spot small sentiment. Thanks!

The first part started from 3 nucleus\’s Xenon to introduce XBox procedural the synthesis technology (related patent), this technology first time on 360 applied in Xbox. Puts briefly, procedural synthesis is from the static storage high-level scene data, the dynamic production first floor\’s geometry data, achieves to the system bandwidth and main memory\’s best use. The first part also introduced how Real-time tessellation, Real-time skinning,3 nucleus Xenon does help to realize procedural synthesis. About Xenon, we have a look
        
        (this is only simulates schematic drawing)

        In Xbox 360 belly\’s Xenon has 3 same PowerPC core, shares the 1M 2 levels of buffers. The author is called each core PowerPC Processing Element (PPE). Each PPE has independence 64Kcache, the instruction and the data occupies half respectively. Through simultaneous multithreading (SMT), at the same time each PPE are most may process 2 threads. To entire Xenon, is at the same time 6. Uses procedural rendering the game after to be able to profit from this characteristic. Said according to the Microsoft patent that uses the above technical the game to have such two main parts:
        Host thread: Including game\’s main operative paragraph, this thread will manage the 3D engine\’s higher order part, and under controls this Host the thread thread.
        Host thread: Completes the goal geometry data true procedural the synthesis process. In this thread output game each kind of object\’s apex tabulation, hands over the GPU processing.
        These two threads may run on identical PPE, or on two PPE. Processes the above two threads, in the game may also have other independent threads to complete work and so on AI, IO.
        Behind is Cache. Xenon provided more programming to control Cache the ability.
       
        After first part of outline introduction, the second part described the Xenon design in detail and realizes. First is the tradition increases carries out the difficult position which the core quantity\’s method brings, afterward is the Xenon design:
        Cell and Xenon have the common basic concept: Removes these attempt when the movement optimizes the instruction dispatch the hardware, reduces carries out the core order of complexity. Cell and Xenon have abandoned instruction window, no longer emphasizes instruction-level parallelism, the instruction honestly the order which takes out is wound through according to them CPU, certainly, if the neighboring instruction does not have the interdependence, still concurrently carried out.
        Such static state carries out the mechanism to look like method – - for instance galloping which very much these obsolete old fogies use. The static execution is very easy to realize, and takes less die. Has not had instruction window and the related hardware, the space which instruction window saves down may lay aside the more true execution units.
        Certainly you cannot kick aside instruction window then to displace with the more execution units on is well with everything wash one\’s hands, total makes anything to compensate removes some tradeoff– which instruction window brings, since the instruction window concept is must let in the multi-execution unit situation enhance instruction-level parallelism, then has removed the instruction window multi-execution unit chip, how can reconsider to organize CPU.
       
        Xenon no longer depends upon instruction window when the movement decides ILP, it is cooler, lets the programmer when the translation arranges the instruction class, in such code has contained high-level thread level parallel TLP. The XBox 360 programmers want laborious somewhat, to their request also high somewhat. In organizes the massive execution unit after 3 core inside, each core contains the relative quantity few execution unit independently, arranges many parallel threads which after the programmer the instruction manifold becomes on these independent core to move. The final outcome is each thread ILP value is quite possibly low, but the total combination effect is in 3 PPE all execution units stems from is present at work the output condition, efficiency quite good. It is said that the TLP strategy such thread level parallel duty is effective to procedural the synthesis – - compares these inborn is especially the single thread duty in big execution unit huge instruction before this on the window construction performance, does not have that well. 3 kind of game related duties will possibly receive Xenon this kind to lack the chaotic foreword to carry out the function chip negative influence: The game controls, AI and Physics.
        Xenon PowerPC Processing the Element detail is:
        L1 cache: 32K instructions/32K data
        Two-issue superscalar execution
        In-order execution
        Two-way simultaneous multithreading
       
        PPE assembly line
        
        
        About PPE execution core. Accurately speaking, composes PPE execution core execution unit the quantity and the nature did not have the official disclosure. The extrapolation PPE execution unit (execution unit):
        1 integer unit
        1 floating-point unit
        1 branch unit
        1 load-store unit
        2 VMX-128 units
        What most conspicuous and lets the human confuse is the Xenon vector execution unit (or SIMD execution unit), front it will complete mainly procedural synthesis which mentioned.
        Xenon VMX will provide 128 registers, each 128 – - to be honest, also live to our such majority in 32, sometimes will return to 16 even 8 times the human, this characteristic very very bleeding-edge ……This is also the VMX-128 name origin.
        Each thread may use these 128 vector register, calculated according to 128 register * 2 threads, on die has 256 physical vector register.
        At present can PPC 32 instruction format how adaptive this 3 operand each operand achieve 128 monsters not to be clear. IBM will play any hack to incur the technique to be secret.
        About VMX-128 another news is, to enhance the performance, in 128 real men will put out some constitutes in the register pool, pool register by many function call sharing – -, so long as they in identical thread.

        Simultaneous multithreading– is natural, I did not know that turns concurrently parallel, parallel is quite good.
        With ultra thread P4 compared to, Xenon SMT realizes is very simple. Is it possible that however as if does PPE always design and the P4 Netburst construction has quite many similarities (the hero to see slightly with…). Their design philosophy is narrow and deep, certainly PPE some very important difference.
        The PPE assembly line has 21 steps, with Northwood P4 same (Prescoot? I do not know…). Xenon PPE has the 32K instruction and 32K data cache, to the double thread and the deep assembly line, is as if a little small. The PowerPC 970 23 step assembly lines have 64K instruction/data cache, but the PowerPC G4 7 step assembly line has 32K instruction/data Cache. However, this is not the Xenon disaster, in fact 32K instruction/data 1 level of cache is very general, moreover, when it takes a mechanical games chip time, the performance influence was minisculer. Because in the Xbox 360 such game applications, the exploiter may achieve the very slight control to the hardware and the cache level, if takes away the build convention application Xenon, for instance the apple machine, that truly a little was perhaps troublesome.
        Similarly, the Xenon 1MB 2 levels of buffers very got up slightly also probably a little unexpectedly – - unexpectedly – - specially are to 3 nucleus CPU. Theoretically, shares cache core to be more, this cache is as if big on this Vietnam. However in fact, accurately said that should be shares cache to carry out the thread to be more concurrently, this cache on this Vietnam big – -, but, those who a little move the stone to pound oneself foot is, under this view, Xenon cache does not suit – -, because Xenon core supports two groups multithreading, like this calculated that down, most will have 6 threads to share 1M 2 level of cache. Certainly, the IBM fellow is not the non-thousandth head, they suppose like this haggled over that small 2 level of cache has 2 substantial clauses:
        First, does 3 nucleus CPU in the limited die size not that simply, that many open areas have not remained for cache (were hoping that the Shanghai Municipal government also understands similar truth, little comments on and punctuate makes cache such upscale building, stationary point land for building gives us, Shanghai this die only then such big). Xenon already the jumbo to must use the water cooling equipment to be able to work, greatly was again unreasonable.
        Second, a more important reason is, flows media streaming media– is specially Xenon– uses cache the way, had decided 1 level of cache was such greatly enough. Xenon faces the application situation, will not use cache very effectively. This kind of Media application usually can let the class data flow fast, cache very little meets the superposition to use. A real diagnosis is the match raises, plays Quake time, does not have the cache match to raise nearly with it matches the cahce nabs to rush two equally quickly.
       
        Deep assembly line\’s another question is the branch forecast. PPE has certainly this function. Because lacks the material, what at present may infer, the PPE branch forecast that the error rate will possibly be higher than PowerPC 970. The PPE strategy as if expected that on the software provides branch hints, moreover the hardware completes the branch forecast work are few, means that such software branch hints is more effective. Xenon—, but also has the Cell– programmers, refuels, the branch hints work, the better game performance depended on you.

        Is the same with PS3 Cell, XBox 360 Xenon and its senior difference are huge, the strategy is different. Using procedural synthesis and the multithreading, XBox 360 can build has the attraction compared to present\’s any game equipment or PC the visual environment.
        To these branch sensitive code, for instance AI and the control, the Xenon performance possibly is ordinary is very even bad. Xenon can be streaming media monster absolutely, but if did not calculate that perhaps face attractive work, will play in the engine these to let play a more amusing factor to receive the injury on Xenon. Too small 2 level of cache will let AI and control section performance not good – - they must result in with procedural synthesis as well as other graph code shares this pitiful 1M– programmers to result in uses energy well can let these non-graph code obtain the high performance.
        Although Xenon may simultaneously move 6 threads, but above said branch sensitive codes and so on AI and control do not have the graph code such high thread level parallel characteristic, 6 thread heroes do not have the opportunity. Similarly these in the chaotic foreword carries out on CPU to be possible to scatter happy the code not to be able to carry out in the strategy from the Xenon order to profit.
        Therefore, the final outcome is, XBox 360 have provided the excellent graph resources for the exploiter, but requests them to spend the multi-spot energy in game\’s non-graph factor. Actually trades an angle also to be possible such to look that in the PC market, the exploiter must support many kinds of CPU, they have no way for some kind of CPU to make the special performance adjustment and the optimization. This kind of isomerism has injured branch the hints such optimization method. Comparatively, Xenon also calculated, the hardware platform was definite, the exploiter had the opportunity to be possible at least to carry on profiling with all one\’s strength, optimization and so on, therefore above said the controls and the AI part\’s inferiority might obtain certain atonement. Along with the Xenon use, the exploiters always had the means to find the solution to cope with the branch sensitive code to carry out efficiency and so on question – - this not are person\’s strength – - certainly, played fans not to be able to count on that the first generation of game such was good; – P
        Since looked at Xenon, also will be suitable forecasts its biconditional gate PS3 Cell– it to be perhaps worse. Cell only then 1 PPE–Xenon has 3, this means that the programmer must the control, AI and the true code fills to most 2 threads in – - shares narrow execution core, moreover does not have instruction window. PS3 SPE simply does not even support the branch forecast. In addition, PS3 2 level of cache is poorer, only then 521K, Xenon half. Briefly speaking, in the non-graph code, PS3 360 will be worse than XBox, but with the aid of PS3 7 SPE, its game picture 360 strong were more than XBox.

        Obviously the idiots knew that one kind of mechanical games\’ success or not is decided in including, but not limited to processor construction outside quite many factors, but Cell which, since the XBox 360 constructions and PS3 will use has the common place, they as if in the close common starting line, enable the people have a relish to sit the family to observe the tiger to fight.

Original text address

Leave a Reply

click to changeSecurity Code